Re: Apache Spark git repo moved to gitbox.apache.org
I filed a ticket: https://issues.apache.org/jira/browse/INFRA-17403 Please add your support there. On Tue, Dec 11, 2018 at 4:58 PM, Sean Owen < sro...@apache.org > wrote: > > I asked on the original ticket at https:/ / issues. apache. org/ jira/ browse/ > INFRA-17385 ( https://issues.apache.org/jira/browse/INFRA-17385 ) but no > follow-up. Go ahead and open a new INFRA ticket. > > On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I >> want to put some pressure myself there too. >> >> >> >> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < srowen@ apache. org ( >> sro...@apache.org ) > wrote: >> >>> >>> >>> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra >>> noise. >>> >>> >>> >>> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin < vanzin@ cloudera. com ( >>> van...@cloudera.com ) > wrote: >>> >>> Hmm, it also seems that github comments are being sync'ed to jira. That's gonna get old very quickly, we should probably ask infra to disable that (if we can't do it ourselves). On Mon, Dec 10, 2018 at 9:13 AM Sean Owen < srowen@ apache. org ( sro...@apache.org ) > wrote: > > > Update for committers: now that my user ID is synced, I can successfully > push to remote https:/ / github. com/ apache/ spark ( > https://github.com/apache/spark ) directly. Use that as the 'apache' > remote > (if you like; gitbox also works). I confirmed the sync works both ways. > > > > As a bonus you can directly close pull requests when needed instead of > using "Close Stale PRs" pull requests. > > > > On Mon, Dec 10, 2018 at 10:30 AM Sean Owen < srowen@ apache. org ( > sro...@apache.org ) > wrote: > > >> >> >> Per the thread last week, the Apache Spark repos have migrated from >> https:/ >> / git-wip-us. apache. org/ repos/ asf ( >> https://git-wip-us.apache.org/repos/asf ) to >> https:/ / gitbox. apache. org/ repos/ asf ( >> https://gitbox.apache.org/repos/asf ) >> >> >> >> Non-committers: >> >> >> >> This just means repointing any references to the old repository to the >> new >> one. It won't affect you if you were already referencing https:/ / >> github. >> com/ apache/ spark ( https://github.com/apache/spark ). >> >> >> >> Committers: >> >> >> >> Follow the steps at https:/ / reference. apache. org/ committer/ github ( >> https://reference.apache.org/committer/github ) to fully sync your ASF >> and >> Github accounts, and then wait up to an hour for it to finish. >> >> >> >> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For >> our standard setup that works with the merge script, that should be your >> 'apache' remote. For example here are my current remotes: >> >> >> >> $ git remote -v >> apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git ( >> https://gitbox.apache.org/repos/asf/spark.git ) (fetch) apache https:/ / >> gitbox. >> apache. org/ repos/ asf/ spark. git ( >> https://gitbox.apache.org/repos/asf/spark.git ) (push) apache-github >> git:// >> github. com/ apache/ spark ( http://github.com/apache/spark ) (fetch) >> apache-github git:// github. com/ apache/ spark ( >> http://github.com/apache/spark ) (push) origin https:/ / github. com/ >> srowen/ >> spark ( https://github.com/srowen/spark ) (fetch) >> origin https:/ / github. com/ srowen/ spark ( >> https://github.com/srowen/spark ) (push) >> upstream https:/ / github. com/ apache/ spark ( >> https://github.com/apache/spark ) (fetch) >> upstream https:/ / github. com/ apache/ spark ( >> https://github.com/apache/spark ) (push) >> >> >> >> In theory we also have read/write access to github. com ( >> http://github.com/ ) now too, but right now it hadn't yet worked for me. >> It >> may need to sync. This note just makes sure anyone knows how to keep >> pushing commits right now to the new ASF repo. >> >> >> >> Report any problems here! >> >> >> >> Sean >> >> > > > > - To > unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) > > -- Marcelo >>> >>> >>> >>> - To >>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( >>> dev-unsubscr...@spark.apache.org ) >>> >>> >>> >> >> > >
Re: Apache Spark git repo moved to gitbox.apache.org
I asked on the original ticket at https://issues.apache.org/jira/browse/INFRA-17385 but no follow-up. Go ahead and open a new INFRA ticket. On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin wrote: > Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I > want to put some pressure myself there too. > > > On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen wrote: > >> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra >> noise. >> >> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin >> wrote: >> >> Hmm, it also seems that github comments are being sync'ed to jira. That's >> gonna get old very quickly, we should probably ask infra to disable that >> (if we can't do it ourselves). >> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen wrote: >> >> Update for committers: now that my user ID is synced, I can successfully >> push to remote https://github.com/apache/spark directly. Use that as the >> 'apache' remote (if you like; gitbox also works). I confirmed the sync >> works both ways. >> >> As a bonus you can directly close pull requests when needed instead of >> using "Close Stale PRs" pull requests. >> >> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen wrote: >> >> Per the thread last week, the Apache Spark repos have migrated from >> https://git-wip-us.apache.org/repos/asf to >> https://gitbox.apache.org/repos/asf >> >> Non-committers: >> >> This just means repointing any references to the old repository to the >> new one. It won't affect you if you were already referencing >> https://github.com/apache/spark . >> >> Committers: >> >> Follow the steps at https://reference.apache.org/committer/github to >> fully sync your ASF and Github accounts, and then wait up to an hour for it >> to finish. >> >> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For >> our standard setup that works with the merge script, that should be your >> 'apache' remote. For example here are my current remotes: >> >> $ git remote -v >> apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache >> https://gitbox.apache.org/repos/asf/spark.git (push) apache-github git:// >> github.com/apache/spark (fetch) apache-github git:// >> github.com/apache/spark (push) origin https://github.com/srowen/spark >> (fetch) >> origin https://github.com/srowen/spark (push) >> upstream https://github.com/apache/spark (fetch) >> upstream https://github.com/apache/spark (push) >> >> In theory we also have read/write access to github.com now too, but >> right now it hadn't yet worked for me. It may need to sync. This note just >> makes sure anyone knows how to keep pushing commits right now to the new >> ASF repo. >> >> Report any problems here! >> >> Sean >> >> - To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- >> Marcelo >> >> - To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > >
Re: Apache Spark git repo moved to gitbox.apache.org
Me too. I want to put some input as well if that can be helpful. On Wed, 12 Dec 2018, 8:20 am Reynold Xin Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I > want to put some pressure myself there too. > > > On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen wrote: > >> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra >> noise. >> >> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin >> wrote: >> >> Hmm, it also seems that github comments are being sync'ed to jira. That's >> gonna get old very quickly, we should probably ask infra to disable that >> (if we can't do it ourselves). >> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen wrote: >> >> Update for committers: now that my user ID is synced, I can successfully >> push to remote https://github.com/apache/spark directly. Use that as the >> 'apache' remote (if you like; gitbox also works). I confirmed the sync >> works both ways. >> >> As a bonus you can directly close pull requests when needed instead of >> using "Close Stale PRs" pull requests. >> >> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen wrote: >> >> Per the thread last week, the Apache Spark repos have migrated from >> https://git-wip-us.apache.org/repos/asf to >> https://gitbox.apache.org/repos/asf >> >> Non-committers: >> >> This just means repointing any references to the old repository to the >> new one. It won't affect you if you were already referencing >> https://github.com/apache/spark . >> >> Committers: >> >> Follow the steps at https://reference.apache.org/committer/github to >> fully sync your ASF and Github accounts, and then wait up to an hour for it >> to finish. >> >> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For >> our standard setup that works with the merge script, that should be your >> 'apache' remote. For example here are my current remotes: >> >> $ git remote -v >> apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache >> https://gitbox.apache.org/repos/asf/spark.git (push) apache-github git:// >> github.com/apache/spark (fetch) apache-github git:// >> github.com/apache/spark (push) origin https://github.com/srowen/spark >> (fetch) >> origin https://github.com/srowen/spark (push) >> upstream https://github.com/apache/spark (fetch) >> upstream https://github.com/apache/spark (push) >> >> In theory we also have read/write access to github.com now too, but >> right now it hadn't yet worked for me. It may need to sync. This note just >> makes sure anyone knows how to keep pushing commits right now to the new >> ASF repo. >> >> Report any problems here! >> >> Sean >> >> - To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- >> Marcelo >> >> - To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > >
Re: Apache Spark git repo moved to gitbox.apache.org
Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I want to put some pressure myself there too. On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < sro...@apache.org > wrote: > > > > Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra > noise. > > > > On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin < vanzin@ cloudera. com ( > van...@cloudera.com ) > wrote: > > >> >> >> Hmm, it also seems that github comments are being sync'ed to jira. That's >> gonna get old very quickly, we should probably ask infra to disable that >> (if we can't do it ourselves). >> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen < srowen@ apache. org ( >> sro...@apache.org ) > wrote: >> >> >>> >>> >>> Update for committers: now that my user ID is synced, I can successfully >>> push to remote https:/ / github. com/ apache/ spark ( >>> https://github.com/apache/spark ) directly. Use that as the 'apache' remote >>> (if you like; gitbox also works). I confirmed the sync works both ways. >>> >>> >>> >>> As a bonus you can directly close pull requests when needed instead of >>> using "Close Stale PRs" pull requests. >>> >>> >>> >>> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen < srowen@ apache. org ( >>> sro...@apache.org ) > wrote: >>> >>> Per the thread last week, the Apache Spark repos have migrated from https:/ / git-wip-us. apache. org/ repos/ asf ( https://git-wip-us.apache.org/repos/asf ) to https:/ / gitbox. apache. org/ repos/ asf ( https://gitbox.apache.org/repos/asf ) Non-committers: This just means repointing any references to the old repository to the new one. It won't affect you if you were already referencing https:/ / github. com/ apache/ spark ( https://github.com/apache/spark ). Committers: Follow the steps at https:/ / reference. apache. org/ committer/ github ( https://reference.apache.org/committer/github ) to fully sync your ASF and Github accounts, and then wait up to an hour for it to finish. Then repoint your git-wip-us remotes to gitbox in your git checkouts. For our standard setup that works with the merge script, that should be your 'apache' remote. For example here are my current remotes: $ git remote -v apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git ( https://gitbox.apache.org/repos/asf/spark.git ) (fetch) apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git ( https://gitbox.apache.org/repos/asf/spark.git ) (push) apache-github git://github.com/apache/spark (fetch) apache-github git://github.com/apache/spark (push) origin https:/ / github. com/ srowen/ spark ( https://github.com/srowen/spark ) (fetch) origin https:/ / github. com/ srowen/ spark ( https://github.com/srowen/spark ) (push) upstream https:/ / github. com/ apache/ spark ( https://github.com/apache/spark ) (fetch) upstream https:/ / github. com/ apache/ spark ( https://github.com/apache/spark ) (push) In theory we also have read/write access to github. com ( http://github.com/ ) now too, but right now it hadn't yet worked for me. It may need to sync. This note just makes sure anyone knows how to keep pushing commits right now to the new ASF repo. Report any problems here! Sean >>> >>> >>> >>> - To >>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( >>> dev-unsubscr...@spark.apache.org ) >>> >>> >> >> >> >> -- >> Marcelo >> >> > > > > - To > unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) > > >
Re: GitHub sync
Now, it's recovered. Dongjoon. On Tue, Dec 11, 2018 at 2:15 PM Dongjoon Hyun wrote: > https://issues.apache.org/jira/browse/INFRA-17401 is filed. > > Dongjoon. > > On Tue, Dec 11, 2018 at 12:49 PM Dongjoon Hyun > wrote: > >> Hi, All. >> >> Currently, GitHub `spark:branch-2.4` is out of sync (with two commits). >> >> >> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.4 >> https://github.com/apache/spark/commits/branch-2.4 >> >> I did the followings already. >> >>1. Wait for the next commit. >>2. Trigger resync at Apache Selfserv site >>3. Merge and push directly to GitHub `branch-2.4` (thanks to GitBox >> transition.) >> >> However, after syncing correctly with 3, the new patches are gone. >> Technically, GitHub `branch-2.4` seems to be force-pushed by some other >> entity. After more investigation, I'm going to file an INFRA issue for >> this. Please note this. >> >> Bests, >> Dongjoon. >> >>
Re: GitHub sync
https://issues.apache.org/jira/browse/INFRA-17401 is filed. Dongjoon. On Tue, Dec 11, 2018 at 12:49 PM Dongjoon Hyun wrote: > Hi, All. > > Currently, GitHub `spark:branch-2.4` is out of sync (with two commits). > > > https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.4 > https://github.com/apache/spark/commits/branch-2.4 > > I did the followings already. > >1. Wait for the next commit. >2. Trigger resync at Apache Selfserv site >3. Merge and push directly to GitHub `branch-2.4` (thanks to GitBox > transition.) > > However, after syncing correctly with 3, the new patches are gone. > Technically, GitHub `branch-2.4` seems to be force-pushed by some other > entity. After more investigation, I'm going to file an INFRA issue for > this. Please note this. > > Bests, > Dongjoon. > >
GitHub sync
Hi, All. Currently, GitHub `spark:branch-2.4` is out of sync (with two commits). https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.4 https://github.com/apache/spark/commits/branch-2.4 I did the followings already. 1. Wait for the next commit. 2. Trigger resync at Apache Selfserv site 3. Merge and push directly to GitHub `branch-2.4` (thanks to GitBox transition.) However, after syncing correctly with 3, the new patches are gone. Technically, GitHub `branch-2.4` seems to be force-pushed by some other entity. After more investigation, I'm going to file an INFRA issue for this. Please note this. Bests, Dongjoon.
Re: proposal for expanded & consistent timestamp types
Of course. I added some comments in the doc. On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid wrote: > Hi Li, > > thanks for the comments! I admit I had not thought very much about python > support, its a good point. But I'd actually like to clarify one thing > about the doc -- though it discusses java types, the point is actually > about having support for these logical types at the SQL level. The doc > uses java names instead of SQL names just because there is so much > confusion around the SQL names, as they haven't been implemented > consistently. Once there is support for the additional logical types, then > we'd absolutely want to get the same support in python. > > Its great to hear there are existing python types we can map each behavior > to. Could you add a comment on the doc on each of the types, mentioning > the equivalent in python? > > thanks, > Imran > > On Fri, Dec 7, 2018 at 1:33 PM Li Jin wrote: > >> Imran, >> >> Thanks for sharing this. When working on interop between Spark and >> Pandas/Arrow in the past, we also faced some issues due to the different >> definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp >> has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime >> or OffsetDateTime semantics. (Detailed discussion is in the PR: >> https://github.com/apache/spark/pull/18664#issuecomment-316554156.) >> >> For one I am excited to see this effort going but also would love to see >> interop of Python to be included/considered in the picture. I don't think >> it adds much to what has already been proposed already because Python >> timestamps are basically LocalDateTime or OffsetDateTime. >> >> Li >> >> >> >> On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid >> wrote: >> >>> Hi, >>> >>> I'd like to discuss the future of timestamp support in Spark, in >>> particular with respect of handling timezones in different SQL types. In >>> a nutshell: >>> >>> * There are at least 3 different ways of handling the timestamp type >>> across timezone changes >>> * We'd like Spark to clearly distinguish the 3 types (it currently >>> implements 1 of them), in a way that is backwards compatible, and also >>> compliant with the SQL standard. >>> * We'll get agreement across Spark, Hive, and Impala. >>> >>> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed >>> doc, describing the problem in more detail, the state of various SQL >>> engines, and how we can get to a better state without breaking any current >>> use cases. The proposal is good for Spark by itself. We're also going to >>> the Hive & Impala communities with this proposal, as its better for >>> everyone if everything is compatible. >>> >>> Note that this isn't proposing a specific implementation in Spark as >>> yet, just a description of the overall problem and our end goal. We're >>> going to each community to get agreement on the overall direction. Then >>> each community can figure out specifics as they see fit. (I don't think >>> there are any technical hurdles with this approach eg. to decide whether >>> this would be even possible in Spark.) >>> >>> Here's a link to the doc Zoltan has put together. It is a bit long, but >>> it explains how such a seemingly simple concept has become such a mess and >>> how we can get to a better state. >>> >>> >>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky >>> >>> Please review the proposal and let us know your opinions, concerns and >>> suggestions. >>> >>> thanks, >>> Imran >>> >>
[Apache Beam] Custom DataSourceV2 instanciation: parameters passing and Encoders
Hi Spark guys, I'm Etienne Chauchot and I'm a committer on the Apache Beam project. We have what we call runners. They are pieces of software that translate pipelines written using Beam API into pipelines that use native execution engine API. Currently, the Spark runner uses old RDD / DStream APIs. I'm writing a new runner that will use structured streaming (but not continuous processing, and also no schema for now). I am just starting. I'm currently trying to map our sources to yours. I'm targeting new DataSourceV2 API. It maps pretty well with Beam sources but I have a problem with instanciation of the custom source. I searched for an answer in stack-overflow and user ML with no luck. I guess it is a too specific question: When visiting Beam DAG I have access to Beam objects such as Source and Reader that I need to map to MicroBatchReader and InputPartitionReader. As far as I understand, a custom DataSourceV2 is instantiated automatically by spark thanks to sparkSession.readStream().format(providerClassName) or similar code. The problem is that I can only pass options of primitive types + String so I cannot pass the Beam Source to DataSourceV2. => Is there a way to do so ? Also I get as an output a Dataset. The Row contains an instance of Beam WindowedValue, T is the type parameter of the Source. I do a map on the Dataset to transform it to a Dataset>. I have a question related to the Encoder: => how to properly create an Encoder for the generic type WindowedValue to use in the map? Here is the code: https://github.com/apache/beam/tree/spark-runner_structured-streaming And more specially: https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/ReadSourceTranslatorBatch.java https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/io/DatasetSource.java Thanks, Etienne
Re: Self join
I don’t know your exact underlying business problem, but maybe a graph solution, such as Spark Graphx meets better your requirements. Usually self-joins are done to address some kind of graph problem (even if you would not describe it as such) and is for these kind of problems much more efficient. > Am 11.12.2018 um 12:44 schrieb Marco Gaido : > > Hi all, > > I'd like to bring to the attention of a more people a problem which has been > there for long, ie, self joins. Currently, we have many troubles with them. > This has been reported several times to the community and seems to affect > many people, but as of now no solution has been accepted for it. > > I created a PR some time ago in order to address the problem > (https://github.com/apache/spark/pull/21449), but Wenchen mentioned he tried > to fix this problem too but so far no attempt was successful because there is > no clear semantic > (https://github.com/apache/spark/pull/21449#issuecomment-393554552). > > So I'd like to propose to discuss here which is the best approach for > tackling this issue, which I think would be great to fix for 3.0.0, so if we > decide to introduce breaking changes in the design, we can do that. > > Thoughts on this? > > Thanks, > Marco
Re: proposal for expanded & consistent timestamp types
Hi Li, thanks for the comments! I admit I had not thought very much about python support, its a good point. But I'd actually like to clarify one thing about the doc -- though it discusses java types, the point is actually about having support for these logical types at the SQL level. The doc uses java names instead of SQL names just because there is so much confusion around the SQL names, as they haven't been implemented consistently. Once there is support for the additional logical types, then we'd absolutely want to get the same support in python. Its great to hear there are existing python types we can map each behavior to. Could you add a comment on the doc on each of the types, mentioning the equivalent in python? thanks, Imran On Fri, Dec 7, 2018 at 1:33 PM Li Jin wrote: > Imran, > > Thanks for sharing this. When working on interop between Spark and > Pandas/Arrow in the past, we also faced some issues due to the different > definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp > has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime > or OffsetDateTime semantics. (Detailed discussion is in the PR: > https://github.com/apache/spark/pull/18664#issuecomment-316554156.) > > For one I am excited to see this effort going but also would love to see > interop of Python to be included/considered in the picture. I don't think > it adds much to what has already been proposed already because Python > timestamps are basically LocalDateTime or OffsetDateTime. > > Li > > > > On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid > wrote: > >> Hi, >> >> I'd like to discuss the future of timestamp support in Spark, in >> particular with respect of handling timezones in different SQL types. In >> a nutshell: >> >> * There are at least 3 different ways of handling the timestamp type >> across timezone changes >> * We'd like Spark to clearly distinguish the 3 types (it currently >> implements 1 of them), in a way that is backwards compatible, and also >> compliant with the SQL standard. >> * We'll get agreement across Spark, Hive, and Impala. >> >> Zoltan Ivanfi (Parquet PMC, also my coworker) has written up a detailed >> doc, describing the problem in more detail, the state of various SQL >> engines, and how we can get to a better state without breaking any current >> use cases. The proposal is good for Spark by itself. We're also going to >> the Hive & Impala communities with this proposal, as its better for >> everyone if everything is compatible. >> >> Note that this isn't proposing a specific implementation in Spark as yet, >> just a description of the overall problem and our end goal. We're going to >> each community to get agreement on the overall direction. Then each >> community can figure out specifics as they see fit. (I don't think there >> are any technical hurdles with this approach eg. to decide whether this >> would be even possible in Spark.) >> >> Here's a link to the doc Zoltan has put together. It is a bit long, but >> it explains how such a seemingly simple concept has become such a mess and >> how we can get to a better state. >> >> >> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.dq3b1mwkrfky >> >> Please review the proposal and let us know your opinions, concerns and >> suggestions. >> >> thanks, >> Imran >> >
Re: Self join
Marco, Thanks for starting the discussion! I think it would be great to have a clear description of the problem and a proposed solution. Do you have anything like that? It would help bring the rest of us up to speed without reading different pull requests. Thanks! rb On Tue, Dec 11, 2018 at 3:54 AM Marco Gaido wrote: > Hi all, > > I'd like to bring to the attention of a more people a problem which has > been there for long, ie, self joins. Currently, we have many troubles with > them. This has been reported several times to the community and seems to > affect many people, but as of now no solution has been accepted for it. > > I created a PR some time ago in order to address the problem ( > https://github.com/apache/spark/pull/21449), but Wenchen mentioned he > tried to fix this problem too but so far no attempt was successful because > there is no clear semantic ( > https://github.com/apache/spark/pull/21449#issuecomment-393554552). > > So I'd like to propose to discuss here which is the best approach for > tackling this issue, which I think would be great to fix for 3.0.0, so if > we decide to introduce breaking changes in the design, we can do that. > > Thoughts on this? > > Thanks, > Marco > -- Ryan Blue Software Engineer Netflix
Re: Pushdown in DataSourceV2 question
In v2, it is up to the data source to tell Spark that a pushed filter is satisfied, by returning the pushed filters that Spark should run. You can indicate that a filter is handled by the source by not returning it for Spark. You can also show that a filter is used by the source by showing it in the output for the plan node, which I think is the `description` method in the latest set of changes. If you want to check with an external source to see what can be pushed down, then you can do that any time in your source implementation. On Tue, Dec 11, 2018 at 3:46 AM Noritaka Sekiyama wrote: > Hi, > Thank you for responding to this thread. I'm really interested in this > discussion. > > My original idea might be the same as what Alessandro said, introducing a > mechanism that Spark can communicate with DataSource and get metadata which > shows if pushdown is supported or not. > I'm wondering if it will be such expensive or not.. > > > > > 2018年12月10日(月) 20:12 Alessandro Solimando >: > >> I think you are generally right, but there are so many different >> scenarios that it might not always be the best option, consider for >> instance a "fast" network in between a single data source and "Spark", lots >> of data, an "expensive" (with low selectivity) expression as Wenchen >> suggested. >> >> In such a case it looks to me that you end up "re-scanning" the whole >> dataset just to make sure the filter has been applied, where having such an >> info as metadata or via a communication protocol with the data source (if >> supported) would be cheaper. >> >> If there is no support at all for such a mechanism I think it could be >> worth exploring a bit more the idea. However, supporting such a mechanism >> would require some developing effort for each datasource to support (e.g., >> asking the datasource for the physical plan applied at query time, the >> ability to parse it to extract relevant info and act on them), as I am not >> aware of any general interface for exchanging such information. >> >> >> >> On Sun, 9 Dec 2018 at 15:34, Jörn Franke wrote: >> >>> It is not about lying or not or trust or not. Some or all filters may >>> not be supported by a data source. Some might only be applied under certain >>> environmental conditions (eg enough memory etc). >>> >>> It is much more expensive to communicate between Spark and a data source >>> which filters have been applied or not than just checking it as Spark does. >>> Especially if you have several different data sources at the same time >>> (joins etc). >>> >>> Am 09.12.2018 um 14:30 schrieb Wenchen Fan : >>> >>> expressions/functions can be expensive and I do think Spark should trust >>> data source and not re-apply pushed filters. If data source lies, many >>> things can go wrong... >>> >>> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: >>> Well even if it has to apply it again, if pushdown is activated then it will be much less cost for spark to see if the filter has been applied or not. Applying the filter is negligible, what it really avoids if the file format implements it is IO cost (for reading) as well as cost for converting from the file format internal datatype to the one of Spark. Those two things are very expensive, but not the filter check. In the end, it could be also data source internal reasons not to apply a filter (there can be many depending on your scenario, the format etc). Instead of “discussing” between Spark and the data source it is much less costly that Spark checks that the filters are consistently applied. Am 09.12.2018 um 12:39 schrieb Alessandro Solimando < alessandro.solima...@gmail.com>: Hello, that's an interesting question, but after Frank's reply I am a bit puzzled. If there is no control over the pushdown status how can Spark guarantee the correctness of the final query? Consider a filter pushed down to the data source, either Spark has to know if it has been applied or not, or it has to re-apply the filter anyway (and pay the price for that). Is there any other option I am not considering? Best regards, Alessandro Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke ha scritto: > BTW. Even for json a pushdown can make sense to avoid that data is > unnecessary ending in Spark ( because it would cause unnecessary > overhead). > In the datasource v2 api you need to implement a SupportsPushDownFilter > > > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama < > moomind...@gmail.com>: > > > > Hi, > > > > I'm a support engineer, interested in DataSourceV2. > > > > Recently I had some pain to troubleshoot to check if pushdown is > actually applied or not. > > I noticed that DataFrame's explain() method shows pushdown even for > JSON. > > It totally depends on DataSource side, I believe. However, I would >
Self join
Hi all, I'd like to bring to the attention of a more people a problem which has been there for long, ie, self joins. Currently, we have many troubles with them. This has been reported several times to the community and seems to affect many people, but as of now no solution has been accepted for it. I created a PR some time ago in order to address the problem ( https://github.com/apache/spark/pull/21449), but Wenchen mentioned he tried to fix this problem too but so far no attempt was successful because there is no clear semantic ( https://github.com/apache/spark/pull/21449#issuecomment-393554552). So I'd like to propose to discuss here which is the best approach for tackling this issue, which I think would be great to fix for 3.0.0, so if we decide to introduce breaking changes in the design, we can do that. Thoughts on this? Thanks, Marco
Re: Pushdown in DataSourceV2 question
Hi, Thank you for responding to this thread. I'm really interested in this discussion. My original idea might be the same as what Alessandro said, introducing a mechanism that Spark can communicate with DataSource and get metadata which shows if pushdown is supported or not. I'm wondering if it will be such expensive or not.. 2018年12月10日(月) 20:12 Alessandro Solimando : > I think you are generally right, but there are so many different scenarios > that it might not always be the best option, consider for instance a "fast" > network in between a single data source and "Spark", lots of data, an > "expensive" (with low selectivity) expression as Wenchen suggested. > > In such a case it looks to me that you end up "re-scanning" the whole > dataset just to make sure the filter has been applied, where having such an > info as metadata or via a communication protocol with the data source (if > supported) would be cheaper. > > If there is no support at all for such a mechanism I think it could be > worth exploring a bit more the idea. However, supporting such a mechanism > would require some developing effort for each datasource to support (e.g., > asking the datasource for the physical plan applied at query time, the > ability to parse it to extract relevant info and act on them), as I am not > aware of any general interface for exchanging such information. > > > > On Sun, 9 Dec 2018 at 15:34, Jörn Franke wrote: > >> It is not about lying or not or trust or not. Some or all filters may not >> be supported by a data source. Some might only be applied under certain >> environmental conditions (eg enough memory etc). >> >> It is much more expensive to communicate between Spark and a data source >> which filters have been applied or not than just checking it as Spark does. >> Especially if you have several different data sources at the same time >> (joins etc). >> >> Am 09.12.2018 um 14:30 schrieb Wenchen Fan : >> >> expressions/functions can be expensive and I do think Spark should trust >> data source and not re-apply pushed filters. If data source lies, many >> things can go wrong... >> >> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: >> >>> Well even if it has to apply it again, if pushdown is activated then it >>> will be much less cost for spark to see if the filter has been applied or >>> not. Applying the filter is negligible, what it really avoids if the file >>> format implements it is IO cost (for reading) as well as cost for >>> converting from the file format internal datatype to the one of Spark. >>> Those two things are very expensive, but not the filter check. In the end, >>> it could be also data source internal reasons not to apply a filter (there >>> can be many depending on your scenario, the format etc). Instead of >>> “discussing” between Spark and the data source it is much less costly that >>> Spark checks that the filters are consistently applied. >>> >>> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando < >>> alessandro.solima...@gmail.com>: >>> >>> Hello, >>> that's an interesting question, but after Frank's reply I am a bit >>> puzzled. >>> >>> If there is no control over the pushdown status how can Spark guarantee >>> the correctness of the final query? >>> >>> Consider a filter pushed down to the data source, either Spark has to >>> know if it has been applied or not, or it has to re-apply the filter anyway >>> (and pay the price for that). >>> >>> Is there any other option I am not considering? >>> >>> Best regards, >>> Alessandro >>> >>> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke ha >>> scritto: >>> BTW. Even for json a pushdown can make sense to avoid that data is unnecessary ending in Spark ( because it would cause unnecessary overhead). In the datasource v2 api you need to implement a SupportsPushDownFilter > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama < moomind...@gmail.com>: > > Hi, > > I'm a support engineer, interested in DataSourceV2. > > Recently I had some pain to troubleshoot to check if pushdown is actually applied or not. > I noticed that DataFrame's explain() method shows pushdown even for JSON. > It totally depends on DataSource side, I believe. However, I would like Spark to have some way to confirm whether specific pushdown is actually applied in DataSource or not. > > # Example > val df = spark.read.json("s3://sample_bucket/people.json") > df.printSchema() > df.filter($"age" > 20).explain() > > root > |-- age: long (nullable = true) > |-- name: string (nullable = true) > > == Physical Plan == > *Project [age#47L, name#48] > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) >+- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://sample_bucket/people.json], PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)], >>