RE: [DISCUSS] Spark 2.5 release

2019-09-25 Thread JOAQUIN GUANTER GONZALBEZ
I’ll chime in as an actual implementor of a custom DataSource who is keeping an 
eye on the 3.0 DSv2 changes.

We started implementing DSv2 in the 2.4 branch, but quickly discovered that the 
DSv2 in 3.0 was a complete breaking change (to the point where it could have 
been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 
has a compatibility layer for DSv1 datasources, we decided to fall back into 
DSv1 in order to ease the future transition to Spark 3.

From my point of view, a Spark 2.5 release with a backport of DSv2 _which does 
not remove the old 2.4 DSv2 classes_  would be ideal, since it would work as a 
stepping stone for both the current users of DSv1 and the 2.4 DSv2 classes.

I agree with Xiao that it is likely that the 3.0 DSv2 classes will need to 
incorporate feedback from the community once people start using them. I hope we 
aren’t planning on marking them as Stable as soon as Spark 3.0 is released! 
They don’t seen to have any InterfaceStability marker at the moment in master.

Cheers,
Ximo

De: Ryan Blue 
Enviado el: miércoles, 25 de septiembre de 2019 0:54
Para: Jungtaek Lim 
CC: Dongjoon Hyun ; Holden Karau 
; Hyukjin Kwon ; Marco Gaido 
; Matei Zaharia ; Reynold Xin 
; Spark Dev List 
Asunto: Re: [DISCUSS] Spark 2.5 release

> That's not a new requirement, that's an "implicit" requirement via semantic 
> versioning.

The expectation is that the DSv2 API will change in minor versions in the 2.x 
line. The API is marked with the Experimental API annotation to signal that it 
can change, and it has been changing.

A requirement to not change this API for a 2.5 release is a new requirement. 
I'm fine with that if that's what everyone wants. Like I said, if we want to 
add a requirement to not change this API then we shouldn't release the 2.5 that 
I'm proposing.

On Tue, Sep 24, 2019 at 2:51 PM Jungtaek Lim 
mailto:kabh...@gmail.com>> wrote:
>> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.

> This has not been a requirement for DSv2 development so far. If this is a new 
> requirement, then we should not do a 2.5 release.

My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever have 
a chance to think about such requirement - that's why there's no restriction on 
breaking compatibility on codebase. That's not a new requirement, that's an 
"implicit" requirement via semantic versioning. I agree that some of APIs have 
been changed between Spark 2.x versions, but I guess the changes in "new" DSv2 
would be bigger than summation of changes on "old" DSv2 which has been 
introduced across multiple minor versions.

Suppose we're developers of Spark ecosystem maintaining custom data source 
(forget about developing Spark): I would get some official announcement on next 
minor version, and I want to try it out quickly to see my stuff still supports 
new version. When I change the dependency version everything will break. My 
hopeful expectation would be no issue while upgrading but turns out it's not, 
and even it requires new learning (not only fixing compilation failures). It 
would just make me giving up support Spark 2.5 or at least I won't follow up 
such change quickly. IMHO 3.0-techpreview has advantage here (assuming we 
provide maven artifacts as well as official announcement), as it can give us 
expectation that there're bunch of changes given it's a new major version. It 
also provides bunch of time to try adopting it before the version is officially 
released.


On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue 
mailto:rb...@netflix.com>> wrote:
From those questions, I can see that there is significant confusion about what 
I'm proposing, so let me try to clear it up.

> 1. Is DSv2 stable in `master`?

DSv2 has reached a stable API that is capable of supporting all of the features 
we intend to deliver for Spark 3.0. The proposal is to backport the same API 
and features for Spark 2.5.

I am not saying that this API won't change after 3.0. Notably, Reynold wants to 
change the use of InternalRow. But, these changes are after 3.0 and don't 
affect the compatibility I'm proposing, between the 2.5 and 3.0 releases. I 
also doubt that breaking changes would happen by 3.1.

> 2. If then, what subset of DSv2 patches does Ryan is suggesting backporting?

I am proposing backporting what we intend to deliver for 3.0: the API currently 
in master, SQL support, and multi-catalog support.

> 3. How much those backporting DSv2 patches looks differently in `branch-2.4`?

DSv2 is mostly an addition located in the `connector` package. It also changes 
some parts of the SQL parser and adds parsed plans, as well as new rules to 
convert from parsed plans. This is not an invasive change because we kept most 
of DSv2 separate. DSv2 should be nearly identical between the two branches.

> 4. What does he mean by `without breaking changes? Is it technically feasible?

DSv2 is marked unstable in the 2.x line and changes between releases. The API 
changed between 

RE: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-13 Thread JOAQUIN GUANTER GONZALBEZ
Great! Please add 
joaquin.guantergonzal...@telefonica.com<mailto:joaquin.guantergonzal...@telefonica.com>
 to the list of attendees.

Thanks,
Ximo

De: Ryan Blue 
Enviado el: lunes, 10 de diciembre de 2018 18:46
Para: JOAQUIN GUANTER GONZALBEZ 
CC: Wenchen Fan ; Spark Dev List 
Asunto: Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save 
optional

Anyone can attend the v2 sync. You just need to let me know what email address 
you'd like to have added. Sorry it is invite-only. That's a limitation of the 
platform (hangouts), the Spark community welcomes anyone that wants to 
participate.

On Mon, Dec 10, 2018 at 1:00 AM JOAQUIN GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
 wrote:
Ah, yes, you are right. The DataSourceV2 APIs wouldn’t let an implementor mark 
a DataSet as “bucketed”. Is there any documentation about the upcoming table 
support for data source v2 or any way of getting invited to the DataSourceV2 
community sync?

Thanks!
Ximo.

De: Wenchen Fan mailto:cloud0...@gmail.com>>
Enviado el: miércoles, 5 de diciembre de 2018 15:51
Para: JOAQUIN GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
CC: Spark dev list mailto:dev@spark.apache.org>>
Asunto: Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save 
optional

The bucket feature is designed to only work with data sources with table 
support, and currently the table support is not public yet, which means no 
external data sources can access bucketing information right now. The bucket 
feature only works with Spark native file source tables.

We are working on adding table support to data source v2, and we should have a 
good story about bucket when it's done.

On Tue, Nov 27, 2018 at 1:01 AM JOAQUIN GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
 wrote:
Hello,

I have a proposal for a small improvement in the Datasource API and I’d like to 
know if it sounds like a change the Spark project would accept.

Currently, the `.save` method in DataFrameWriter will fail if the dataframe is 
bucketed and/or sorted. This makes sense, since there is no way of storing 
metadata in the current file-based data sources to know whether a file was 
bucketed or not.

I have a use case where I would like to implement a new, file-based data source 
which could keep track of that kind of metadata (without using the 
HiveMetastore), so I would like to be able to `.save` bucketed dataframes.

Would a patch to extend the datasource api with an indicator of whether that 
source is able to serialize bucketed dataframes be a welcome addition? I'm 
happy to work on it if that’s the case.

I have opened this as https://issues.apache.org/jira/browse/SPARK-26160 in the 
Spark Jira.

Cheers,
Ximo.



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged an

RE: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-10 Thread JOAQUIN GUANTER GONZALBEZ
Ah, yes, you are right. The DataSourceV2 APIs wouldn’t let an implementor mark 
a DataSet as “bucketed”. Is there any documentation about the upcoming table 
support for data source v2 or any way of getting invited to the DataSourceV2 
community sync?

Thanks!
Ximo.

De: Wenchen Fan 
Enviado el: miércoles, 5 de diciembre de 2018 15:51
Para: JOAQUIN GUANTER GONZALBEZ 
CC: Spark dev list 
Asunto: Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save 
optional

The bucket feature is designed to only work with data sources with table 
support, and currently the table support is not public yet, which means no 
external data sources can access bucketing information right now. The bucket 
feature only works with Spark native file source tables.

We are working on adding table support to data source v2, and we should have a 
good story about bucket when it's done.

On Tue, Nov 27, 2018 at 1:01 AM JOAQUIN GUANTER GONZALBEZ 
mailto:joaquin.guantergonzal...@telefonica.com>>
 wrote:
Hello,

I have a proposal for a small improvement in the Datasource API and I’d like to 
know if it sounds like a change the Spark project would accept.

Currently, the `.save` method in DataFrameWriter will fail if the dataframe is 
bucketed and/or sorted. This makes sense, since there is no way of storing 
metadata in the current file-based data sources to know whether a file was 
bucketed or not.

I have a use case where I would like to implement a new, file-based data source 
which could keep track of that kind of metadata (without using the 
HiveMetastore), so I would like to be able to `.save` bucketed dataframes.

Would a patch to extend the datasource api with an indicator of whether that 
source is able to serialize bucketed dataframes be a welcome addition? I'm 
happy to work on it if that’s the case.

I have opened this as https://issues.apache.org/jira/browse/SPARK-26160 in the 
Spark Jira.

Cheers,
Ximo.



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, r

[SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-11-26 Thread JOAQUIN GUANTER GONZALBEZ
Hello,

I have a proposal for a small improvement in the Datasource API and I'd like to 
know if it sounds like a change the Spark project would accept.

Currently, the `.save` method in DataFrameWriter will fail if the dataframe is 
bucketed and/or sorted. This makes sense, since there is no way of storing 
metadata in the current file-based data sources to know whether a file was 
bucketed or not.

I have a use case where I would like to implement a new, file-based data source 
which could keep track of that kind of metadata (without using the 
HiveMetastore), so I would like to be able to `.save` bucketed dataframes.

Would a patch to extend the datasource api with an indicator of whether that 
source is able to serialize bucketed dataframes be a welcome addition? I'm 
happy to work on it if that's the case.

I have opened this as https://issues.apache.org/jira/browse/SPARK-26160 in the 
Spark Jira.

Cheers,
Ximo.



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener informaci?n privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilizaci?n, divulgaci?n y/o copia sin 
autorizaci?n puede estar prohibida en virtud de la legislaci?n vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma v?a y proceda a su destrucci?n.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinat?rio, pode 
conter informa??o privilegiada ou confidencial e ? para uso exclusivo da pessoa 
ou entidade de destino. Se n?o ? vossa senhoria o destinat?rio indicado, fica 
notificado de que a leitura, utiliza??o, divulga??o e/ou c?pia sem autoriza??o 
pode estar proibida em virtude da legisla??o vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destrui??o


RE: Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
Hi Daniel,

I am glad you already ran the numbers on this change ☺ (for anyone reading, 
they can be found on slide 19 in 
http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos
 ). I haven’t done any formal benchmarking, but the speedup in our jobs is 
highly noticeable.

I agree it can be done without modifying Spark (we also have our own 
implementation in my codebase), but it seems a pity that anyone using the RDD 
API won’t get the benefits having a sorted RDD (which happens quite often since 
the shuffle phase can sort!).

Ximo.

De: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com]
Enviado el: lunes, 21 de marzo de 2016 16:20
Para: Ted Yu <yuzhih...@gmail.com>
CC: JOAQUIN GUANTER GONZALBEZ <joaquin.guantergonzal...@telefonica.com>; 
dev@spark.apache.org
Asunto: Re: Performance improvements for sorted RDDs

There is related discussion in 
https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to 
implement this without modifying Spark and we measured ~10x improvement over 
plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also 
realize this performance advantage.

On Mon, Mar 21, 2016 at 11:41 AM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
Do you have performance numbers to backup this proposal for cogroup operation ?

Thanks

On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ 
<joaquin.guantergonzal...@telefonica.com<mailto:joaquin.guantergonzal...@telefonica.com>>
 wrote:
Hello devs,

I have found myself in a situation where Spark is doing sub-optimal 
computations for my RDDs, and I was wondering whether a patch to enable 
improved performance for this scenario would be a welcome addition to Spark or 
not.

The scenario happens when trying to cogroup two RDDs that are sorted by key and 
share the same partitioner. CoGroupedRDD will correctly detect that the RDDs 
have the same partitioner and will therefore create narrow cogroup split 
dependencies, as opposed to shuffle dependencies. This is great because it 
prevents any shuffling from happening. However, the cogroup is unable to detect 
that the RDDs are sorted in the same way, and will still insert all elements of 
the RDD in a map in order to join the elements with the same key.

When both RDDs are sorted using the same order, the cogroup can just join by 
doing a single pass over the data (since the data is ordered by key, you can 
just keep iterating until you find a different key). This would greatly reduce 
the memory requirements for these kind of operations.

Adding this to spark would require adding an “ordering” member to RDD of type 
Option[Ordering], similarly to how the “partitioner” field works. That way, the 
sorting operations could populate this field and the operations that could 
benefit from this knowledge (cogroup, join, groupbykey, etc.) could read it to 
change their behavior accordingly.

Do you think this would be a good addition to Spark?

Thanks,
Ximo



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição





Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la 

Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
Hello devs,

I have found myself in a situation where Spark is doing sub-optimal 
computations for my RDDs, and I was wondering whether a patch to enable 
improved performance for this scenario would be a welcome addition to Spark or 
not.

The scenario happens when trying to cogroup two RDDs that are sorted by key and 
share the same partitioner. CoGroupedRDD will correctly detect that the RDDs 
have the same partitioner and will therefore create narrow cogroup split 
dependencies, as opposed to shuffle dependencies. This is great because it 
prevents any shuffling from happening. However, the cogroup is unable to detect 
that the RDDs are sorted in the same way, and will still insert all elements of 
the RDD in a map in order to join the elements with the same key.

When both RDDs are sorted using the same order, the cogroup can just join by 
doing a single pass over the data (since the data is ordered by key, you can 
just keep iterating until you find a different key). This would greatly reduce 
the memory requirements for these kind of operations.

Adding this to spark would require adding an “ordering” member to RDD of type 
Option[Ordering], similarly to how the “partitioner” field works. That way, the 
sorting operations could populate this field and the operations that could 
benefit from this knowledge (cogroup, join, groupbykey, etc.) could read it to 
change their behavior accordingly.

Do you think this would be a good addition to Spark?

Thanks,
Ximo



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição