Re: Writing to HDFS and cluster utilization

2018-06-15 Thread Rohit Karlupia
Hi,


The minimal solution is to enable dynamicAllocation and set idle timeout to
low value. This will ensure that idle executors are killed and resources
available for others to use,


spark.dynamicAllocation.enabled

spark.dynamicAllocation.executorIdleTimeout

If you would like to understand "wastage" vs "completion time"  with
different executor counts, try Sparklens https://github.com/qubole/sparklens

thanks,
rohitk








On Fri, Jun 15, 2018 at 6:32 PM, Alessandro Liparoti <
alessandro.l...@gmail.com> wrote:

> Hi,
>
> I would like to briefly present you my use case and gather possible useful
> suggestions from the community. I am developing a spark job which massively
> read from and write to Hive. Usually, I use 200 executors with 12g memory
> each and a parallelism level of 600. The main run of the application
> consists of phases: read from hdfs, persist, small and simple aggregations,
> write to hdfs. These steps are repeated a certain number of time.
> When I write to Hive, I aim to have partitions of approximately 50/70mb,
> therefore I repartition before writing in output in approximately 15 parts
> (according to the data size). The writing phase takes around 1.5 minutes;
> this means that for 1.5 minutes only 15 out of 600 possible active tasks
> are running in parallel. This looks a big waste of resources. How would you
> solve the problem?
>
> I am trying to experiment with the FAIR scheduler and job pools, but it
> seems not improving a lot; for some reasons, I cannot have more than 4
> parallel jobs running. I am investigating this opportunity right now, maybe
> I will provide more details about it afterwards.
>
> I would like to know if this use case is normal, what would you do and if
> in your opinion I am doing something wrong.
>
> Thanks,
> *Alessandro Liparoti*
>


Re: time for Apache Spark 3.0?

2018-06-15 Thread Reynold Xin
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new feature.
> How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache Spark
> not yet supporting Scala 2.12, and that got me to think perhaps it is about
> time for Spark to work towards the 3.0 release. By the time it comes out,
> it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to give
> more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016.
> If we were to maintain the ~ 2 year cadence, it is time to work on Spark
> 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
> >>>
> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
> >>>
> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
> >>>
> >>>
> >>> Now the reality of a major version bump is that the world often thinks
> in terms of what exciting features are coming. I do think there are a
> number of major changes happening already that can be part of the 3.0
> release, if they make it in:
> >>>
> >>> 1. Scala 2.12 support (listing it twice)
> >>> 2. Continuous Processing non-experimental
> >>> 3. Kubernetes support non-experimental
> >>> 4. A more flushed out version of data source API v2 (I don’t think it
> is realistic to stabilize that in one release)
> >>> 5. Hadoop 3.0 support
> >>> 6. ...
> >>>
> >>>
> >>>
> >>> Similar to the 2.0 discussion, this thread should focus on 

Re: Time for 2.1.3

2018-06-15 Thread Wenchen Fan
+1

On Fri, Jun 15, 2018 at 7:10 AM, Tom Graves 
wrote:

> +1 for doing a 2.1.3 release.
>
> Tom
>
> On Wednesday, June 13, 2018, 7:28:26 AM CDT, Marco Gaido <
> marcogaid...@gmail.com> wrote:
>
>
> Yes, you're right Herman. Sorry, my bad.
>
> Thanks.
> Marco
>
> 2018-06-13 14:01 GMT+02:00 Herman van Hövell tot Westerflier <
> her...@databricks.com>:
>
> Isn't this only a problem with Spark 2.3.x?
>
> On Wed, Jun 13, 2018 at 1:57 PM Marco Gaido 
> wrote:
>
> Hi Marcelo,
>
> thanks for bringing this up. Maybe we should consider to include
> SPARK-24495, as it is causing some queries to return an incorrect result.
> What do you think?
>
> Thanks,
> Marco
>
> 2018-06-13 1:27 GMT+02:00 Marcelo Vanzin :
>
> Hey all,
>
> There are some fixes that went into 2.1.3 recently that probably
> deserve a release. So as usual, please take a look if there's anything
> else you'd like on that release, otherwise I'd like to start with the
> process by early next week.
>
> I'll go through jira to see what's the status of things targeted at
> that release, but last I checked there wasn't anything on the radar.
>
> Thanks!
>
> --
> Marcelo
>
> -- -- -
> To unsubscribe e-mail: dev-unsubscribe@spark.apache. org
> 
>
>
>
>


Unsubscribe

2018-06-15 Thread Mikhail Dubkov
Unsubscribe

On Thu, Jun 14, 2018 at 8:38 PM Kumar S, Sajive 
wrote:

> Unsubscribe
>


Re: time for Apache Spark 3.0?

2018-06-15 Thread Mridul Muralidharan
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra  wrote:
>
> Changing major version numbers is not about new features or a vague notion 
> that it is time to do something that will be seen to be a significant 
> release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 
>> 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How 
>> about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as 
>> Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should 
>> be better to drift away the historical burden and focus in new area. Spark 
>> has been widely used all over the world as a successful big data framework. 
>> And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not 
>>> yet supporting Scala 2.12, and that got me to think perhaps it is about 
>>> time for Spark to work towards the 3.0 release. By the time it comes out, 
>>> it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more 
>>> context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If 
>>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 
>>> in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs 
>>> in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
>>> necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 
>>> 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for 
>>> disruptive API changes to make it painful for users to update. The main 
>>> purpose of a major release is an opportunity to fix things that are broken 
>>> in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing 
>>> major new features incrementally, so major releases are not the only time 
>>> for exciting new features. For example, the bulk of the work in the move 
>>> towards the DataFrame API was done in Spark 1.3, and Continuous Processing 
>>> was introduced in Spark 2.3. Both were feature releases rather than major 
>>> releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support 
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
>>> Similar to Spark 2.0, I think there are also opportunities for other 
>>> changes that we know have been biting us for a long time but can’t be 
>>> changed in feature releases (to be clear, I’m actually not sure they are 
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 
>>> 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, 
>>> to prevent users from shooting themselves in the foot, e.g. “SELECT 2 
>>> SECOND” -- is “SECOND” an interval unit or an alias? To make it less 
>>> painful for users to upgrade here, I’d suggest creating a flag for backward 
>>> compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard 
>>> compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. 
>>> “JavaPairRDD flatMapValues requires function returning Iterable, not 
>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in 
>>> terms of what exciting features are coming. I do think there are a number 
>>> of major changes happening already that can be part of the 3.0 release, if 
>>> they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is 
>>> realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework 
>>> and whether it’d make sense to create Spark 3.0 as the next release, rather 
>>> than the individual feature requests. Those are important 

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mark Hamstra
Changing major version numbers is not about new features or a vague notion
that it is time to do something that will be seen to be a significant
release. It is about breaking stable public APIs.

I still remain unconvinced that the next version can't be 2.4.0.

On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:

> *Dear all:*
>
> It have been 2 months since this topic being proposed. Any progress now?
> 2018 has been passed about 1/2.
>
> I agree with that the new version should be some exciting new feature. How
> about this one:
>
> *6. ML/DL framework to be integrated as core component and feature. (Such
> as Angel / BigDL / ……)*
>
> 3.0 is a very important version for an good open source project. It should
> be better to drift away the historical burden and *focus in new area*.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
>
>
> *Andy*
>
>
> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
>
>> There was a discussion thread on scala-contributors
>> 
>> about Apache Spark not yet supporting Scala 2.12, and that got me to think
>> perhaps it is about time for Spark to work towards the 3.0 release. By the
>> time it comes out, it will be more than 2 years since Spark 2.0.
>>
>> For contributors less familiar with Spark’s history, I want to give more
>> context on Spark releases:
>>
>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
>> in 2018.
>>
>> 2. Spark’s versioning policy promises that Spark does not break stable
>> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>>
>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>>
>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>>
>>
>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>
>>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>> 1. Support Scala 2.12.
>>
>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>>
>> 3. Shade all dependencies.
>>
>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>>
>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>>
>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>>
>>
>> Now the reality of a major version bump is that the world often thinks in
>> terms of what exciting features are coming. I do think there are a number
>> of major changes happening already that can be part of the 3.0 release, if
>> they make it in:
>>
>> 1. Scala 2.12 support (listing it twice)
>> 2. Continuous Processing non-experimental
>> 3. Kubernetes support non-experimental
>> 4. A more flushed out version of data source API v2 (I don’t think it is
>> realistic to stabilize that in one release)
>> 5. Hadoop 3.0 support
>> 6. ...
>>
>>
>>
>> Similar to the 2.0 discussion, this thread should focus on the framework
>> and whether it’d make sense to create Spark 3.0 as the next release, rather
>> than the individual feature requests. Those are important but are best done
>> in their own separate threads.
>>
>>
>>
>>
>>


Re: Time for 2.1.3

2018-06-15 Thread Tom Graves
 +1 for doing a 2.1.3 release.  
Tom
On Wednesday, June 13, 2018, 7:28:26 AM CDT, Marco Gaido 
 wrote:  
 
 Yes, you're right Herman. Sorry, my bad.
Thanks.Marco
2018-06-13 14:01 GMT+02:00 Herman van Hövell tot Westerflier 
:

Isn't this only a problem with Spark 2.3.x?
On Wed, Jun 13, 2018 at 1:57 PM Marco Gaido  wrote:

Hi Marcelo,
thanks for bringing this up. Maybe we should consider to include SPARK-24495, 
as it is causing some queries to return an incorrect result.What do you think?
Thanks,Marco
2018-06-13 1:27 GMT+02:00 Marcelo Vanzin :

Hey all,

There are some fixes that went into 2.1.3 recently that probably
deserve a release. So as usual, please take a look if there's anything
else you'd like on that release, otherwise I'd like to start with the
process by early next week.

I'll go through jira to see what's the status of things targeted at
that release, but last I checked there wasn't anything on the radar.

Thanks!

-- 
Marcelo

-- -- -
To unsubscribe e-mail: dev-unsubscribe@spark.apache. org






  

Writing to HDFS and cluster utilization

2018-06-15 Thread Alessandro Liparoti
Hi,

I would like to briefly present you my use case and gather possible useful
suggestions from the community. I am developing a spark job which massively
read from and write to Hive. Usually, I use 200 executors with 12g memory
each and a parallelism level of 600. The main run of the application
consists of phases: read from hdfs, persist, small and simple aggregations,
write to hdfs. These steps are repeated a certain number of time.
When I write to Hive, I aim to have partitions of approximately 50/70mb,
therefore I repartition before writing in output in approximately 15 parts
(according to the data size). The writing phase takes around 1.5 minutes;
this means that for 1.5 minutes only 15 out of 600 possible active tasks
are running in parallel. This looks a big waste of resources. How would you
solve the problem?

I am trying to experiment with the FAIR scheduler and job pools, but it
seems not improving a lot; for some reasons, I cannot have more than 4
parallel jobs running. I am investigating this opportunity right now, maybe
I will provide more details about it afterwards.

I would like to know if this use case is normal, what would you do and if
in your opinion I am doing something wrong.

Thanks,
*Alessandro Liparoti*


Re: Very slow complex type column reads from parquet

2018-06-15 Thread Jakub Wozniak
Hello,

I’m sorry to bother you again but it is quite important for us to understand 
the problem better.

One more finding in our problem is that the performance of queries in a 
timestamp sorted file depend a lot on the predicate timestamp.
If you are lucky to get some records from the start of the row group it might 
be fast (like seconds). If you search for something that is at the end of the 
row group the query takes minutes.
I guess this is due to the fact that it has to scan all the previous records in 
the row group until it finds the right ones at the end of it...

Now I have a couple of questions regarding the way the Parquet file is read.

1) Does it always decode the query column (projection from select) even if the 
predicate column does not match (to me it looks like it but I might be wrong)?
2) Sorting the file will result in “indexed’ row groups so it will be easier to 
locate the which row group to scan but isn’t it at the same time limiting 
parallelism? If the data is randomly placed in the row groups it will be 
searched with as many tasks as we have row groups, right (or at least more than 
1)? Is there any common rule we can formulate or it is very data and/or query 
dependent?
3) Would making a row group smaller (like by half) help? Currently I can see 
that the row groups are about the size of the hdfs block (256MB) but sometimes 
smaller or even bigger.
We have no settings for the row group so I guess the default hdfs block size is 
used, right?
Do you have any recommendation / experience with that?

Thanks a lot for your help,
Jakub



On 14 Jun 2018, at 12:07, Jakub Wozniak 
mailto:jakub.wozn...@cern.ch>> wrote:

Dear Ryan,

Thanks a lot for your answer.
After having sent the e-mail we have investigated a bit more the data itself.
It happened that for certain days it was very skewed and one of the row groups 
had much more records that all others.
This was somehow related to the fact that we have sorted it using our object 
ids and by chance those that went first were smaller (or compressed better).
So the Parquet file had a 6 rows groups where the first one had 300k rows and 
others only 30k rows.
The search for a given object fell into the first row group and lasted very 
long time.
The data itself was very much compressed as it contained a lot of zeros. To 
give some numbers the 600MB parquet file expanded to 56GB in JSON.

What we did is to sort the data not by object id but by the record timestamp 
which resulted in much more even data distribution among the row groups.
This in fact helped a lot for the query time (using the timestamp & object id)

I have to say that I haven't fully understood this phenomenon yet as I’m not a 
Parquet format & reader expert (at least not yet).
Maybe it is just a simple function of how many records Spark has to scan and 
the level of parallelism (searching for a given object id when sorted by time 
needs to scan all/more the groups for larger times).
One question here - is Parquet reader reading & decoding the projection columns 
even if the predicate columns should filter the record out?

Unfortunately we have to have those big columns in the query as people want to 
do analysis on them.

We will continue to investigate…

Cheers,
Jakub



On 12 Jun 2018, at 22:51, Ryan Blue 
mailto:rb...@netflix.com>> wrote:

Jakub,

You're right that Spark currently doesn't use the vectorized read path for 
nested data, but I'm not sure that's the problem here. With 50k elements in the 
f1 array, it could easily be that you're getting the significant speed-up from 
not reading or materializing that column. The non-vectorized path is slower, 
but it is more likely that the problem is the data if it is that much slower.

I'd be happy to see vectorization for nested Parquet data move forward, but I 
think you might want to get an idea of how much it will help before you move 
forward with it. Can you use Impala to test whether vectorization would help 
here?

rb



On Mon, Jun 11, 2018 at 6:16 AM, Jakub Wozniak 
mailto:jakub.wozn...@cern.ch>> wrote:
Hello,

We have stumbled upon a quite degraded performance when reading a complex 
(struct, array) type columns stored in Parquet.
A Parquet file is of around 600MB (snappy) with ~400k rows with a field of a 
complex type { f1: array of ints, f2: array of ints } where f1 array length is 
50k elements.
There are also other fields like entity_id: long, timestamp: long.

A simple query that selects rows using predicates entity_id = X and timestamp 
>= T1 and timestamp <= T2 plus ds.show() takes 17 minutes to execute.
If we remove the complex type columns from the query it is executed in a 
sub-second time.

Now when looking at the implementation of the Parquet datasource the 
Vectorized* classes are used only if the read types are primitives. In other 
case the code falls back to the parquet-mr default implementation.
In the VectorizedParquetRecordReader there is a TODO to handle complex types 
that "should be 

Re: time for Apache Spark 3.0?

2018-06-15 Thread Andy
*Dear all:*

It have been 2 months since this topic being proposed. Any progress now?
2018 has been passed about 1/2.

I agree with that the new version should be some exciting new feature. How
about this one:

*6. ML/DL framework to be integrated as core component and feature. (Such
as Angel / BigDL / ……)*

3.0 is a very important version for an good open source project. It should
be better to drift away the historical burden and *focus in new area*.
Spark has been widely used all over the world as a successful big data
framework. And it can be better than that.


*Andy*


On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:

> There was a discussion thread on scala-contributors
> 
> about Apache Spark not yet supporting Scala 2.12, and that got me to think
> perhaps it is about time for Spark to work towards the 3.0 release. By the
> time it comes out, it will be more than 2 years since Spark 2.0.
>
> For contributors less familiar with Spark’s history, I want to give more
> context on Spark releases:
>
> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
> in 2018.
>
> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
>
> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
>
> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
>
>
> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>
>
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>
> 1. Support Scala 2.12.
>
> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
>
> 3. Shade all dependencies.
>
> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
>
> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
>
> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
>
>
> Now the reality of a major version bump is that the world often thinks in
> terms of what exciting features are coming. I do think there are a number
> of major changes happening already that can be part of the 3.0 release, if
> they make it in:
>
> 1. Scala 2.12 support (listing it twice)
> 2. Continuous Processing non-experimental
> 3. Kubernetes support non-experimental
> 4. A more flushed out version of data source API v2 (I don’t think it is
> realistic to stabilize that in one release)
> 5. Hadoop 3.0 support
> 6. ...
>
>
>
> Similar to the 2.0 discussion, this thread should focus on the framework
> and whether it’d make sense to create Spark 3.0 as the next release, rather
> than the individual feature requests. Those are important but are best done
> in their own separate threads.
>
>
>
>
>


Re: Re: Support SqlStreaming in spark

2018-06-15 Thread Hadrien Chicault
Unsuscribe

2018-06-15 9:20 GMT+02:00 stc :

> The repo you give may solve some of SqlStreaming problems, but not
> friendly enough, user need to learn this new syntax.
>
> --
> Jacky Lee
> Mail:qcsd2...@163.com
>
> At 2018-06-15 11:48:01, "Bowden, Chris" 
> wrote:
>
> Not sure if there is a question in here, but if you are hinting that
> structured streaming should support a sql interface, spark has appropriate
> extensibility hooks to make it possible. However, the most powerful
> construct in structured streaming is quite difficult to find a sql
> equivalent for (e.g., flatMapGroupsWithState). This repo could use some
> cleanup but is an example of providing a sql interface to a subset of
> structured streaming's functionality: https://github.
> com/vertica/pstl/blob/master/pstl/src/main/antlr4/org/
> apache/spark/sql/catalyst/parser/pstl/PstlSqlBase.g4.
>
> --
> *From:* JackyLee 
> *Sent:* Thursday, June 14, 2018 7:06:17 PM
> *To:* dev@spark.apache.org
> *Subject:* Support SqlStreaming in spark
>
> Hello
>
> Nowadays, more and more streaming products begin to support SQL streaming,
> such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
> only reduce the threshold of streaming, but also make streaming easier to
> be
> accepted by everyone.
>
> At present, StructStreaming is relatively mature, and the StructStreaming
> is
> based on DataSet API, which make it possibal to  provide a SQL portal for
> structstreaming and run structstreaming in SQL.
>
> To support for SQL Streaming, there are two key points:
> 1, Analysis should be able to parse streaming type SQL.
> 2, Analyzer should be able to map metadata information to the corresponding
> Relation.
>
> Running StructStreaming in SQL can bring some benefits.
> 1, Reduce the entry threshold of StructStreaming and attract users more
> easily.
> 2, Encapsulate the meta information of source or sink into table, maintain
> and manage uniformly, and make users more accessible.
> 3. Metadata permissions management, which is based on hive, can control
> StructStreaming's overall authority management scheme more closely.
>
> We have found some ways to solve this problem. It's a pleasure to discuss
> it
> with you.
>
> Thanks,
>
> Jackey Lee
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re:Re: Support SqlStreaming in spark

2018-06-15 Thread stc
The repo you give may solve some of SqlStreaming problems, but not friendly 
enough, user need to learn this new syntax.


--

Jacky Lee
Mail:qcsd2...@163.com


At 2018-06-15 11:48:01, "Bowden, Chris"  wrote:


Not sure if there is a question in here, but if you are hinting that structured 
streaming should support a sql interface, spark has appropriate extensibility 
hooks to make it possible. However, the most powerful construct in structured 
streaming is quite difficult to find a sql equivalent for (e.g., 
flatMapGroupsWithState). This repo could use some cleanup but is an example of 
providing a sql interface to a subset of structured streaming's functionality: 
https://github.com/vertica/pstl/blob/master/pstl/src/main/antlr4/org/apache/spark/sql/catalyst/parser/pstl/PstlSqlBase.g4.



From: JackyLee 
Sent: Thursday, June 14, 2018 7:06:17 PM
To:dev@spark.apache.org
Subject: Support SqlStreaming in spark
 
Hello

Nowadays, more and more streaming products begin to support SQL streaming,
such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
only reduce the threshold of streaming, but also make streaming easier to be
accepted by everyone.

At present, StructStreaming is relatively mature, and the StructStreaming is
based on DataSet API, which make it possibal to  provide a SQL portal for
structstreaming and run structstreaming in SQL.

To support for SQL Streaming, there are two key points:
1, Analysis should be able to parse streaming type SQL.
2, Analyzer should be able to map metadata information to the corresponding
Relation.

Running StructStreaming in SQL can bring some benefits.
1, Reduce the entry threshold of StructStreaming and attract users more
easily.
2, Encapsulate the meta information of source or sink into table, maintain
and manage uniformly, and make users more accessible.
3. Metadata permissions management, which is based on hive, can control
StructStreaming's overall authority management scheme more closely.

We have found some ways to solve this problem. It's a pleasure to discuss it
with you.

Thanks, 

Jackey Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org