RE: [DISCUSS] Spark 2.5 release
I’ll chime in as an actual implementor of a custom DataSource who is keeping an eye on the 3.0 DSv2 changes. We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3. From my point of view, a Spark 2.5 release with a backport of DSv2 _which does not remove the old 2.4 DSv2 classes_ would be ideal, since it would work as a stepping stone for both the current users of DSv1 and the 2.4 DSv2 classes. I agree with Xiao that it is likely that the 3.0 DSv2 classes will need to incorporate feedback from the community once people start using them. I hope we aren’t planning on marking them as Stable as soon as Spark 3.0 is released! They don’t seen to have any InterfaceStability marker at the moment in master. Cheers, Ximo De: Ryan Blue Enviado el: miércoles, 25 de septiembre de 2019 0:54 Para: Jungtaek Lim CC: Dongjoon Hyun ; Holden Karau ; Hyukjin Kwon ; Marco Gaido ; Matei Zaharia ; Reynold Xin ; Spark Dev List Asunto: Re: [DISCUSS] Spark 2.5 release > That's not a new requirement, that's an "implicit" requirement via semantic > versioning. The expectation is that the DSv2 API will change in minor versions in the 2.x line. The API is marked with the Experimental API annotation to signal that it can change, and it has been changing. A requirement to not change this API for a 2.5 release is a new requirement. I'm fine with that if that's what everyone wants. Like I said, if we want to add a requirement to not change this API then we shouldn't release the 2.5 that I'm proposing. On Tue, Sep 24, 2019 at 2:51 PM Jungtaek Lim mailto:kabh...@gmail.com>> wrote: >> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. > This has not been a requirement for DSv2 development so far. If this is a new > requirement, then we should not do a 2.5 release. My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever have a chance to think about such requirement - that's why there's no restriction on breaking compatibility on codebase. That's not a new requirement, that's an "implicit" requirement via semantic versioning. I agree that some of APIs have been changed between Spark 2.x versions, but I guess the changes in "new" DSv2 would be bigger than summation of changes on "old" DSv2 which has been introduced across multiple minor versions. Suppose we're developers of Spark ecosystem maintaining custom data source (forget about developing Spark): I would get some official announcement on next minor version, and I want to try it out quickly to see my stuff still supports new version. When I change the dependency version everything will break. My hopeful expectation would be no issue while upgrading but turns out it's not, and even it requires new learning (not only fixing compilation failures). It would just make me giving up support Spark 2.5 or at least I won't follow up such change quickly. IMHO 3.0-techpreview has advantage here (assuming we provide maven artifacts as well as official announcement), as it can give us expectation that there're bunch of changes given it's a new major version. It also provides bunch of time to try adopting it before the version is officially released. On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue mailto:rb...@netflix.com>> wrote: From those questions, I can see that there is significant confusion about what I'm proposing, so let me try to clear it up. > 1. Is DSv2 stable in `master`? DSv2 has reached a stable API that is capable of supporting all of the features we intend to deliver for Spark 3.0. The proposal is to backport the same API and features for Spark 2.5. I am not saying that this API won't change after 3.0. Notably, Reynold wants to change the use of InternalRow. But, these changes are after 3.0 and don't affect the compatibility I'm proposing, between the 2.5 and 3.0 releases. I also doubt that breaking changes would happen by 3.1. > 2. If then, what subset of DSv2 patches does Ryan is suggesting backporting? I am proposing backporting what we intend to deliver for 3.0: the API currently in master, SQL support, and multi-catalog support. > 3. How much those backporting DSv2 patches looks differently in `branch-2.4`? DSv2 is mostly an addition located in the `connector` package. It also changes some parts of the SQL parser and adds parsed plans, as well as new rules to convert from parsed plans. This is not an invasive change because we kept most of DSv2 separate. DSv2 should be nearly identical between the two branches. > 4. What does he mean by `without breaking changes? Is it technically feasible? DSv2 is marked unstable in the 2.x line and changes between releases. The API changed between
RE: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional
Great! Please add joaquin.guantergonzal...@telefonica.com<mailto:joaquin.guantergonzal...@telefonica.com> to the list of attendees. Thanks, Ximo De: Ryan Blue Enviado el: lunes, 10 de diciembre de 2018 18:46 Para: JOAQUIN GUANTER GONZALBEZ CC: Wenchen Fan ; Spark Dev List Asunto: Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional Anyone can attend the v2 sync. You just need to let me know what email address you'd like to have added. Sorry it is invite-only. That's a limitation of the platform (hangouts), the Spark community welcomes anyone that wants to participate. On Mon, Dec 10, 2018 at 1:00 AM JOAQUIN GUANTER GONZALBEZ mailto:joaquin.guantergonzal...@telefonica.com>> wrote: Ah, yes, you are right. The DataSourceV2 APIs wouldn’t let an implementor mark a DataSet as “bucketed”. Is there any documentation about the upcoming table support for data source v2 or any way of getting invited to the DataSourceV2 community sync? Thanks! Ximo. De: Wenchen Fan mailto:cloud0...@gmail.com>> Enviado el: miércoles, 5 de diciembre de 2018 15:51 Para: JOAQUIN GUANTER GONZALBEZ mailto:joaquin.guantergonzal...@telefonica.com>> CC: Spark dev list mailto:dev@spark.apache.org>> Asunto: Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional The bucket feature is designed to only work with data sources with table support, and currently the table support is not public yet, which means no external data sources can access bucketing information right now. The bucket feature only works with Spark native file source tables. We are working on adding table support to data source v2, and we should have a good story about bucket when it's done. On Tue, Nov 27, 2018 at 1:01 AM JOAQUIN GUANTER GONZALBEZ mailto:joaquin.guantergonzal...@telefonica.com>> wrote: Hello, I have a proposal for a small improvement in the Datasource API and I’d like to know if it sounds like a change the Spark project would accept. Currently, the `.save` method in DataFrameWriter will fail if the dataframe is bucketed and/or sorted. This makes sense, since there is no way of storing metadata in the current file-based data sources to know whether a file was bucketed or not. I have a use case where I would like to implement a new, file-based data source which could keep track of that kind of metadata (without using the HiveMetastore), so I would like to be able to `.save` bucketed dataframes. Would a patch to extend the datasource api with an indicator of whether that source is able to serialize bucketed dataframes be a welcome addition? I'm happy to work on it if that’s the case. I have opened this as https://issues.apache.org/jira/browse/SPARK-26160 in the Spark Jira. Cheers, Ximo. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. The information contained in this transmission is privileged an
RE: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional
Ah, yes, you are right. The DataSourceV2 APIs wouldn’t let an implementor mark a DataSet as “bucketed”. Is there any documentation about the upcoming table support for data source v2 or any way of getting invited to the DataSourceV2 community sync? Thanks! Ximo. De: Wenchen Fan Enviado el: miércoles, 5 de diciembre de 2018 15:51 Para: JOAQUIN GUANTER GONZALBEZ CC: Spark dev list Asunto: Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional The bucket feature is designed to only work with data sources with table support, and currently the table support is not public yet, which means no external data sources can access bucketing information right now. The bucket feature only works with Spark native file source tables. We are working on adding table support to data source v2, and we should have a good story about bucket when it's done. On Tue, Nov 27, 2018 at 1:01 AM JOAQUIN GUANTER GONZALBEZ mailto:joaquin.guantergonzal...@telefonica.com>> wrote: Hello, I have a proposal for a small improvement in the Datasource API and I’d like to know if it sounds like a change the Spark project would accept. Currently, the `.save` method in DataFrameWriter will fail if the dataframe is bucketed and/or sorted. This makes sense, since there is no way of storing metadata in the current file-based data sources to know whether a file was bucketed or not. I have a use case where I would like to implement a new, file-based data source which could keep track of that kind of metadata (without using the HiveMetastore), so I would like to be able to `.save` bucketed dataframes. Would a patch to extend the datasource api with an indicator of whether that source is able to serialize bucketed dataframes be a welcome addition? I'm happy to work on it if that’s the case. I have opened this as https://issues.apache.org/jira/browse/SPARK-26160 in the Spark Jira. Cheers, Ximo. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, r
[SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional
Hello, I have a proposal for a small improvement in the Datasource API and I'd like to know if it sounds like a change the Spark project would accept. Currently, the `.save` method in DataFrameWriter will fail if the dataframe is bucketed and/or sorted. This makes sense, since there is no way of storing metadata in the current file-based data sources to know whether a file was bucketed or not. I have a use case where I would like to implement a new, file-based data source which could keep track of that kind of metadata (without using the HiveMetastore), so I would like to be able to `.save` bucketed dataframes. Would a patch to extend the datasource api with an indicator of whether that source is able to serialize bucketed dataframes be a welcome addition? I'm happy to work on it if that's the case. I have opened this as https://issues.apache.org/jira/browse/SPARK-26160 in the Spark Jira. Cheers, Ximo. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener informaci?n privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilizaci?n, divulgaci?n y/o copia sin autorizaci?n puede estar prohibida en virtud de la legislaci?n vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma v?a y proceda a su destrucci?n. The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinat?rio, pode conter informa??o privilegiada ou confidencial e ? para uso exclusivo da pessoa ou entidade de destino. Se n?o ? vossa senhoria o destinat?rio indicado, fica notificado de que a leitura, utiliza??o, divulga??o e/ou c?pia sem autoriza??o pode estar proibida em virtude da legisla??o vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destrui??o
RE: Performance improvements for sorted RDDs
Hi Daniel, I am glad you already ran the numbers on this change ☺ (for anyone reading, they can be found on slide 19 in http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos ). I haven’t done any formal benchmarking, but the speedup in our jobs is highly noticeable. I agree it can be done without modifying Spark (we also have our own implementation in my codebase), but it seems a pity that anyone using the RDD API won’t get the benefits having a sorted RDD (which happens quite often since the shuffle phase can sort!). Ximo. De: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com] Enviado el: lunes, 21 de marzo de 2016 16:20 Para: Ted Yu <yuzhih...@gmail.com> CC: JOAQUIN GUANTER GONZALBEZ <joaquin.guantergonzal...@telefonica.com>; dev@spark.apache.org Asunto: Re: Performance improvements for sorted RDDs There is related discussion in https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to implement this without modifying Spark and we measured ~10x improvement over plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also realize this performance advantage. On Mon, Mar 21, 2016 at 11:41 AM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: Do you have performance numbers to backup this proposal for cogroup operation ? Thanks On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ <joaquin.guantergonzal...@telefonica.com<mailto:joaquin.guantergonzal...@telefonica.com>> wrote: Hello devs, I have found myself in a situation where Spark is doing sub-optimal computations for my RDDs, and I was wondering whether a patch to enable improved performance for this scenario would be a welcome addition to Spark or not. The scenario happens when trying to cogroup two RDDs that are sorted by key and share the same partitioner. CoGroupedRDD will correctly detect that the RDDs have the same partitioner and will therefore create narrow cogroup split dependencies, as opposed to shuffle dependencies. This is great because it prevents any shuffling from happening. However, the cogroup is unable to detect that the RDDs are sorted in the same way, and will still insert all elements of the RDD in a map in order to join the elements with the same key. When both RDDs are sorted using the same order, the cogroup can just join by doing a single pass over the data (since the data is ordered by key, you can just keep iterating until you find a different key). This would greatly reduce the memory requirements for these kind of operations. Adding this to spark would require adding an “ordering” member to RDD of type Option[Ordering], similarly to how the “partitioner” field works. That way, the sorting operations could populate this field and the operations that could benefit from this knowledge (cogroup, join, groupbykey, etc.) could read it to change their behavior accordingly. Do you think this would be a good addition to Spark? Thanks, Ximo Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la
Performance improvements for sorted RDDs
Hello devs, I have found myself in a situation where Spark is doing sub-optimal computations for my RDDs, and I was wondering whether a patch to enable improved performance for this scenario would be a welcome addition to Spark or not. The scenario happens when trying to cogroup two RDDs that are sorted by key and share the same partitioner. CoGroupedRDD will correctly detect that the RDDs have the same partitioner and will therefore create narrow cogroup split dependencies, as opposed to shuffle dependencies. This is great because it prevents any shuffling from happening. However, the cogroup is unable to detect that the RDDs are sorted in the same way, and will still insert all elements of the RDD in a map in order to join the elements with the same key. When both RDDs are sorted using the same order, the cogroup can just join by doing a single pass over the data (since the data is ordered by key, you can just keep iterating until you find a different key). This would greatly reduce the memory requirements for these kind of operations. Adding this to spark would require adding an “ordering” member to RDD of type Option[Ordering], similarly to how the “partitioner” field works. That way, the sorting operations could populate this field and the operations that could benefit from this knowledge (cogroup, join, groupbykey, etc.) could read it to change their behavior accordingly. Do you think this would be a good addition to Spark? Thanks, Ximo Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição