Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Nicholas Chammas
This is a side issue, but I’d like to bring people’s attention to SPARK-28024. Cases 2, 3, and 4 described in that ticket are still problems today on master (I just rechecked) even with ANSI mode enabled. Well, maybe not problems, but I’m flagging this since Spark’s behavior differs in these

Re: Generating config docs automatically

2024-02-22 Thread Nicholas Chammas
dea); but that’s just my > opinion. I'd be happy to help with reviews though. > > On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> I know config documentation is not the most exciting thing. If there is >> anything I can d

Re: Generating config docs automatically

2024-02-21 Thread Nicholas Chammas
I know config documentation is not the most exciting thing. If there is anything I can do to make this as easy as possible for a committer to shepherd, I’m all ears! > On Feb 14, 2024, at 8:53 PM, Nicholas Chammas > wrote: > > I’m interested in automating our config documentat

Generating config docs automatically

2024-02-14 Thread Nicholas Chammas
I’m interested in automating our config documentation and need input from a committer who is interested in shepherding this work. We have around 60 tables of configs across our documentation. Here’s a typical example.

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
aFrame. That helped me to find similar issues in > most cases. > > HTH > > On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> Consider this example: >> >>> from pyspark.sql.functions import sum >> &

How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas
Consider this example: >>> from pyspark.sql.functions import sum >>> spark.range(4).repartition(2).select(sum("id")).show() +---+ |sum(id)| +---+ | 6| +---+ I’m trying to understand how this works because I’m investigating a bug in this kind of aggregate. I see that

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
etch to > leave it lying around until Spark 5. > It's not exactly unused though. > > I would not object to removing it unless there is some voice of support here. > > On Sat, Jan 20, 2024 at 10:38 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote:

Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
From the dev thread: What else could be removed in Spark 4? > On Aug 17, 2023, at 1:44 AM, Yang Jie wrote: > > I would like to know how we should handle the two Kinesis-related modules in > Spark 4.0. They have a very low

Install Ruby 3 to build the docs

2024-01-10 Thread Nicholas Chammas
Just a quick heads up that, while Ruby 2.7 will continue to work, you should plan to install Ruby 3 in the near future in order to build the docs. (I recommend using rbenv to manage multiple Ruby versions.) Ruby 2 reached EOL in March 2023

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation to the user list and BCC-ing the dev list. Also, this statement > We are not validating against table or column existence. is not correct. When you call spark.sql(…), Spark will lookup the table references and

Guidance for filling out "Affects Version" on Jira

2023-12-17 Thread Nicholas Chammas
The Contributing guide only mentions what to fill in for “Affects Version” for bugs. How about for improvements? This question once caused some problems when I set “Affects Version” to the last released version, and that was interpreted as a request

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas <mailto:nicholas.cham...@gmail.com>

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh > wrote: > spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, > or NONE to disable it completely. > Hmm, I’ve also never heard of this setting before and can’t seem to find it in the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh > wrote: > > By default, the CBO is enabled in Spark. Note that this is not correct. AQE is enabled

Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas
in(mode="cost")) what the cost-based optimizer does and how to enable it Would this be a welcome addition to the project’s documentation? I’m happy to work on this. > On Dec 5, 2023, at 12:12 PM, Nicholas Chammas > wrote: > > I’m interested in improving some of t

Re: Algolia search on website is broken

2023-12-10 Thread Nicholas Chammas
onsole. > On Dec 5, 2023, at 11:28 AM, Nicholas Chammas > wrote: > > Should I report this instead on Jira? Apologies if the dev list is not the > right place. > > Search on the website appears to be broken. For example, here is a search for > “analyze”: >  > >

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
This is not a question for the dev list. Moving dev to bcc. One thing I would try is to connect to this database using JDBC + SSH tunnel, but without Spark. That way you can focus on getting the JDBC connection to work without Spark complicating the picture for you. > On Dec 5, 2023, at 8:12 

When and how does Spark use metastore statistics?

2023-12-05 Thread Nicholas Chammas
I’m interested in improving some of the documentation relating to the table and column statistics that get stored in the metastore, and how Spark uses them. But I’m not clear on a few things, so I’m writing to you with some questions. 1. The documentation for 

Algolia search on website is broken

2023-12-05 Thread Nicholas Chammas
Should I report this instead on Jira? Apologies if the dev list is not the right place. Search on the website appears to be broken. For example, here is a search for “analyze”:  And here is the same search using DDG

Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Nicholas Chammas
I’ve always considered DataFrames to be logically equivalent to SQL tables or queries. In SQL, the result order of any query is implementation-dependent without an explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 times in a row and get 10 different orderings. I

Allowing all Reader or Writer settings to be provided as options

2022-08-09 Thread Nicholas Chammas
Hello people, I want to bring some attention to SPARK-39630 and ask if there are any design objections to the idea proposed there. The gist of the proposal is that there are some reader or writer directives that cannot be supplied as

Re: Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas
rmal Github emails - that is if we turn them off do we have anything? > > On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: > I assume I’m not the only one getting these new emails from GitBox. Is there > a story behind that that I misse

Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas
I assume I’m not the only one getting these new emails from GitBox. Is there a story behind that that I missed? I’d rather not get these emails on the dev list. I assume most of the list would agree with me. GitHub has a good set of options for following activity on the repo. People who want

Re: [DISCUSS] Rename 'SQL' to 'SQL / DataFrame', and 'Query' to 'Execution' in SQL UI page

2022-03-28 Thread Nicholas Chammas
+1 Understanding the close relationship between SQL and DataFrames in Spark was a key learning moment for me, but I agree that using the terms interchangeably can be confusing. > On Mar 27, 2022, at 9:27 PM, Hyukjin Kwon wrote: > > *for some reason, the image looks broken (to me). I am

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas
er way of computing aggregations through > composition of other Expressions. > > Simeon > > > > > > On Thu, Dec 9, 2021 at 9:26 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I'm trying to create a new aggregate function. It's my first time

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
> > > On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> No takers here? :) >> >> I can see now why a median function is not available in most data >> processing systems. It's pretty annoying to i

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
No takers here? :) I can see now why a median function is not available in most data processing systems. It's pretty annoying to implement! On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas wrote: > I'm trying to create a new aggregate function. It's my first time working > with Cataly

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas
I'm trying to create a new aggregate function. It's my first time working with Catalyst, so it's exciting---but I'm also in a bit over my head. My goal is to create a function to calculate the median . As a very simple solution, I could just

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nicholas Chammas
Farewell to Jenkins and its classic weather forecast build status icons: [image: health-80plus.png][image: health-60to79.png][image: health-40to59.png][image: health-20to39.png][image: health-00to19.png] And thank you Shane for all the help over these years. Will you be nuking all the

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Nicholas Chammas
Side note about time travel: There is a PR to add VERSION/TIMESTAMP AS OF syntax to Spark SQL. On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue wrote: > I want to note that I wouldn't recommend time traveling this way by using > the hint for `snapshot-id`.

Jira components cleanup

2021-11-15 Thread Nicholas Chammas
https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page I think the "docs" component should be merged into "Documentation". Likewise, the "k8" component should be merged into "Kubernetes". I think anyone can technically update tags, but

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon wrote: > I am currently thinking we will have to convert the Koalas tests to use > unittests to match with PySpark for now. > Keep in mind that pytest supports unittest-based tests out of the box , so

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin wrote: > I don't think we should deprecate existing APIs. > +1 I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way. For the large

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas
te a > reference within a scope which is closed. For example within the body of a > function (without return value) and store it only in a local > variable. After the scope is closed in case of our function when the caller > gets the control back you have chance to see the co

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
of an > unexpected error (in this case you should keep the checkpoint data). > > This way even after an unexpected exit the next run of the same app should > be able to pick up the checkpointed data. > > Best Regards, > Attila > > > > > On Wed, Mar 10, 2021 at 8:

Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
Hello people, I'm working on a fix for SPARK-33000 . Spark does not cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate configs are set. In the course of developing a fix, another contributor pointed out

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 10:34 AM Sean Owen wrote: > There is no way to force people to review or commit something of course. > And keep in mind we get a lot of, shall we say, unuseful pull requests. > There is occasionally some blowback to closing someone's PR, so the path of > least resistance

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 9:58 AM Enrico Minack wrote: > *What is the approved way to ...* > > *... prevent it from being auto-closed?* Committing and commenting to the > PR does not prevent it from being closed the next day. > Committing and commenting should prevent the PR from being closed. It

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Nicholas Chammas
On Thu, Jan 28, 2021 at 3:40 PM Sean Owen wrote: > It isn't that regexp_extract_all (for example) is useless outside SQL, > just, where do you draw the line? Supporting 10s of random SQL functions > across 3 other languages has a cost, which has to be weighed against > benefit, which we can

Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-10-25 Thread Nicholas Chammas
Just want to call out that this SPIP should probably account somehow for PySpark and the work being done in SPARK-32082 to improve PySpark exceptions. On Sun, Oct 25, 2020 at 8:05 PM Xinyi Yu wrote: > Hi all, > > We like to post a SPIP of

Re: get method guid prefix for file parts for write

2020-09-25 Thread Nicholas Chammas
I think what George is looking for is a way to determine ahead of time the partition IDs that Spark will use when writing output. George, I believe this is an example of what you're looking for:

PySpark: Un-deprecating inferring DataFrame schema from list of dictionaries

2020-08-24 Thread Nicholas Chammas
https://github.com/apache/spark/pull/29510 I don't think this is a big deal, but since we're removing a deprecation that has been around for ~6 years, I figured it would be good to bring everyone's attention to this change. Hopefully, we are not breaking any hidden assumptions about the

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this: ``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
The team I'm on currently uses pip-installed PySpark for local development, and we regularly access S3 directly from our laptops/workstations. One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is being able to use a recent version of hadoop-aws that has mature support for s3a.

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Nicholas Chammas
I believe that was fixed in 3.0 and there was a decision not to backport the fix: SPARK-31170 On Wed, Jun 3, 2020 at 1:04 PM Xiao Li wrote: > Just downloaded it in my local macbook. Trying to create a table using the > pre-built PySpark. It

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Nicholas Chammas
wrote: > Maybe it's time to switch. Do you know if we can still link the JIRA > against Github? > The script used to change the status of JIRA too but it stopped working > for a long time - I suspect this isn't a big deal. > > 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이 > 작

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-24 Thread Nicholas Chammas
Have we asked Infra recently about enabling the native Jira-GitHub integration ? Maybe we can deprecate the part of this script that updates Jira tickets with links to the PR and rely on the native

Beginner PR against the Catalog API

2020-04-02 Thread Nicholas Chammas
I recently submitted my first Scala PR. It's very simple, though I don't know if I've done things correctly since I'm not a regular Scala user. SPARK-31000 : Add ability to set table description in the catalog

Re: Automatic PR labeling

2020-04-02 Thread Nicholas Chammas
SPARK-31330 <https://issues.apache.org/jira/browse/SPARK-31330>: Automatically label PRs based on the paths they touch On Wed, Apr 1, 2020 at 11:34 PM Hyukjin Kwon wrote: > @Nicholas Chammas Would you be interested in > tacking a look? I would love this to be done. > > 2020

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Nicholas Chammas
Probably the discussion here about Improvement Jira tickets and the "Affects Version" field: https://github.com/apache/spark/pull/27534#issuecomment-588416416 On Wed, Apr 1, 2020 at 9:59 PM Hyukjin Kwon wrote: > > 2) check with older versions to fill up affects version for bug > I don't agree

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Nicholas Chammas
I don't have a dog in this race, but: Would it be OK to ship 3.0 with some release notes and/or prominent documentation calling out this issue, and then fixing it in 3.0.1? On Sat, Mar 28, 2020 at 8:45 PM Jungtaek Lim wrote: > I'd say SPARK-31257 as open blocker, because the change in upcoming

Automatic PR labeling

2020-03-24 Thread Nicholas Chammas
Public Service Announcement: There is a GitHub action that lets you automatically label PRs based on what paths they modify. https://github.com/actions/labeler If we set this up, perhaps down the line we can update the PR dashboard and PR merge script to use the tags. cc @Dongjoon Hyun , who

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-20 Thread Nicholas Chammas
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > 2. PARTITIONED BY colTypeList: I think we can support it in the unified > syntax. Just make sure it doesn't appear together with PARTITIONED BY > transformList. > Another side note: Perhaps as part of (or after) unifying the CREATE TABLE

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Nicholas Chammas
Side comment: The current docs for CREATE TABLE add to the confusion by describing the Hive-compatible command as "CREATE TABLE USING HIVE FORMAT", but neither

Re-triggering failed GitHub workflows

2020-03-16 Thread Nicholas Chammas
Is there any way contributors can retrigger a failed GitHub workflow, like we do with Jenkins? There's supposed to be a "Re-run all checks" button, but I don't see it. Do we need INFRA to grant permissions for that, perhaps? Right now I'm doing it by adding empty commits: ``` git commit

Re: Running Spark through a debugger

2020-03-12 Thread Nicholas Chammas
What are you having trouble with, does it build? > > On Mon, Dec 16, 2019 at 11:27 PM Nicholas Chammas > wrote: > > > > I normally stick to the Python parts of Spark, but I am interested in > walking through the DSv2 code and understanding how it works. I tried > fol

Re: Auto-linking from PRs to Jira tickets

2020-03-10 Thread Nicholas Chammas
Could you point us to the ticket? I'd like to follow along. On Tue, Mar 10, 2020 at 9:13 AM Alex Ott wrote: > For Zeppelin I've created recently the ASF INFRA Jira for that feature... > Although maybe it should be done for all projects. > > Nicholas Chammas at "Mon, 9 Mar 20

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
t; > On Mon, Mar 9, 2020 at 2:14 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This is a feature of GitHub itself and would auto-link directly from the >> PR back to Jira. >> >> I haven't looked at the PR dashboard in a while, but I believe

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
that runs the PR dashboard, > is it no longer working? > > On Mon, Mar 9, 2020 at 12:28 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> https://github.blog/2019-10-14-introducing-autolink-references/ >> >> GitHub has a feature for auto-li

Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
https://github.blog/2019-10-14-introducing-autolink-references/ GitHub has a feature for auto-linking from PRs to external tickets. It's only available for their paid plans, but perhaps Apache has some arrangement with them where we can get that feature. Since we include Jira ticket numbers in

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Nicholas Chammas
+1 to what Sean said. On Fri, Feb 21, 2020 at 10:14 AM Sean Owen wrote: > We've avoided using Assignee because it implies that someone 'owns' > resolving the issue, when we want to keep it collaborative, and many > times in the past someone would ask to be assigned and then didn't > follow

Re: More publicly documenting the options under spark.sql.*

2020-01-27 Thread Nicholas Chammas
; experimental option that may change, or legacy, or safety valve flag. >>>>> Certainly anything that's marked an internal conf. (That does raise >>>>> the question of who it's for, if you have to read source to find it.) >>>>> >>>>> I don't

Re: Closing stale PRs with a GitHub Action

2020-01-27 Thread Nicholas Chammas
Apr+label%3AStale+is%3Aclosed> is how many PRs are active with relatively recent activity. It's a testament to how active this project is. On Sun, Dec 15, 2019 at 11:16 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Just an FYI to everyone, we’ve merged in an Action to

Re: More publicly documenting the options under spark.sql.*

2020-01-15 Thread Nicholas Chammas
e >>> the question of who it's for, if you have to read source to find it.) >>> >>> I don't know if we need to overhaul the conf system, but there may >>> indeed be some confs that could legitimately be documented. I don't >>> know which. >>> >

More publicly documenting the options under spark.sql.*

2020-01-14 Thread Nicholas Chammas
I filed SPARK-30510 thinking that we had forgotten to document an option, but it turns out that there's a whole bunch of stuff under SQLConf.scala

Running Spark through a debugger

2019-12-16 Thread Nicholas Chammas
I normally stick to the Python parts of Spark, but I am interested in walking through the DSv2 code and understanding how it works. I tried following the "IDE Setup" section of the developer tools page, but quickly hit several problems loading the

Re: Closing stale PRs with a GitHub Action

2019-12-15 Thread Nicholas Chammas
time is long and it posts >>> some friendly message about reopening if there is a material change in the >>> proposed PR, the problem, or interest in merging it. >>> >>> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas < >>> nicholas.cham...@gmail.com

R linter is broken

2019-12-13 Thread Nicholas Chammas
The R linter GitHub action seems to be busted . Looks like we need to update some repository references

Re: Closing stale PRs with a GitHub Action

2019-12-06 Thread Nicholas Chammas
t's standard practice and doesn't mean it can't be > reopened. > Often the related JIRA should be closed as well but we have done that > separately with bulk-close in the past. > > On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > &g

Closing stale PRs with a GitHub Action

2019-12-05 Thread Nicholas Chammas
It’s that topic again.  We have almost 500 open PRs. A good chunk of them are more than a year old. The oldest open PR dates to summer 2015. https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc GitHub has an Action for closing stale PRs.

Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
Hyukjin Kwon wrote: > I think it's broken .. cc Josh Rosen > > 2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이 > 작성: > >> We used to have a bot or something that automatically linked Jira tickets >> to PRs that mentioned them in their title. I don't see that happening >

Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
We used to have a bot or something that automatically linked Jira tickets to PRs that mentioned them in their title. I don't see that happening anymore. Did we intentionally remove this functionality, or is it temporarily broken for some reason?

Re: Can't build unidoc

2019-11-29 Thread Nicholas Chammas
at 11:48 AM Nicholas Chammas > wrote: > > > > Howdy folks. Running `./build/sbt unidoc` on the latest master is giving > me this trace: > > > > ``` > > [warn] :: > > [warn] ::

Can't build unidoc

2019-11-29 Thread Nicholas Chammas
Howdy folks. Running `./build/sbt unidoc` on the latest master is giving me this trace: ``` [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] ::

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Nicholas Chammas
> I don't think the default Hadoop version matters except for the spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2 profile. What do you mean by "only meaningful under the hadoop-3.2 profile"? On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian wrote: > Hey Steve, > > In terms of

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-16 Thread Nicholas Chammas
> Data Source API with Catalog Supports Where can we read more about this? The linked Nabble thread doesn't mention the word "Catalog". On Thu, Nov 7, 2019 at 5:53 PM Xingbo Jiang wrote: > Hi all, > > To enable wide-scale community testing of the upcoming Spark 3.0 release, > the Apache Spark

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-03 Thread Nicholas Chammas
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran wrote: > It would be really good if the spark distributions shipped with later > versions of the hadoop artifacts. > I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer? Koert Kuipers wrote:

Spark 3.0 and S3A

2019-10-28 Thread Nicholas Chammas
Howdy folks, I have a question about what is happening with the 3.0 release in relation to Hadoop and hadoop-aws . Today, among other builds, we release a build of Spark built against Hadoop 2.7 and another one built

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Nicholas Chammas
> > On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> A quick question about failure modes, as a casual observer of the DSv2 >> effort: >> >> I was considering filing a JIRA ticket about enhancing the >> Dat

Re: DSv2 sync - 4 September 2019

2019-09-08 Thread Nicholas Chammas
A quick question about failure modes, as a casual observer of the DSv2 effort: I was considering filing a JIRA ticket about enhancing the DataFrameReader to include the failure *reason* in addition to the corrupt record when the mode is PERMISSIVE. So if you are loading a CSV, for example, and a

Providing a namespace for third-party configurations

2019-08-30 Thread Nicholas Chammas
I discovered today that EMR provides its own optimizations for Spark . Some of these optimizations are controlled by configuration settings with names like `spark.sql.dynamicPartitionPruning.enabled` or

Re: Recognizing non-code contributions

2019-08-05 Thread Nicholas Chammas
On Mon, Aug 5, 2019 at 9:55 AM Sean Owen wrote: > On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz wrote: > > So... events coordinators? I'd still make them committers. I guess I'm > still struggling to understand what problem making people VIP's without > giving them committership is trying to

Python API for mapGroupsWithState

2019-08-02 Thread Nicholas Chammas
Can someone succinctly describe the challenge in adding the `mapGroupsWithState()` API to PySpark? I was hoping for some suboptimal but nonetheless working solution to be available in Python, as there are with Python UDFs for example, but that doesn't seem to be case. The JIRA ticket for

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Nicholas Chammas
This kind of question is for the User list, or for something like Stack Overflow. It's not on topic here. The dev list (i.e. this list) is for discussions about the development of Spark itself. On Wed, May 15, 2019 at 1:50 PM Chetan Khatri wrote: > Any one help me, I am confused. :( > > On

Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Nicholas Chammas
I think the annotations are compatible with Python 2 since Maciej implemented them via stub files , which Python 2 simply ignores. Folks using mypy to check types will get the benefit whether they're on Python 2 or 3,

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Nicholas Chammas
ively mature for core ETL and >> incremental processing purpose. I interact with a lot of users using it >> everyday. We can always expand the use cases and add more, but that also >> adds maintenance burden. In any case, it'd be good to get some activity >> here. >>

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Nicholas Chammas
As an observer, this thread is interesting and concerning. Is there an emerging consensus that Structured Streaming is somehow not relevant anymore? Or is it just that folks consider it "complete enough"? Structured Streaming was billed as the replacement to DStreams. If committers, generally

Re: Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas
it should only send one email when a PR is merged. > > On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Can we somehow disable these new email alerts coming through for the >> Spark website repo? >> >> On Wed, D

Noisy spark-website notifications

2018-12-19 Thread Nicholas Chammas
Can we somehow disable these new email alerts coming through for the Spark website repo? On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: > ueshin commented on a change in pull request #163: Announce the schedule > of 2019 Spark+AI summit at SF > URL: >

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
On Tue, 23 Oct 2018 at 21:32, Sean Owen wrote: > >> The comments say that it is not possible to overload 'and' and 'or', >> which would have been more natural. >> > Yes, unfortunately, Python does not allow you to override and, or, or not. They are not implemented as “dunder” method (e.g.

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
, 2018 at 3:02 PM Nicholas Chammas wrote: > So it appears then that the equivalent operators for PySpark are > completely missing from the docs, right? That’s surprising. And if there > are column function equivalents for |, &, and ~, then I can’t find those > either for PySpark.

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
la/index.html#org.apache.spark.sql.Column > > On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I can’t seem to find any documentation of the &, |, and ~ operators for >> PySpark DataFrame columns. I assume that should be in our

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
> > https://spark.apache.org/docs/2.3.0/api/sql/index.html > > > > On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I can’t seem to find any documentation of the &, |, and ~ operators for >> PySpark DataFram

Documentation of boolean column operators missing?

2018-10-23 Thread Nicholas Chammas
I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere. Was it always missing? Am I just missing something obvious? Nick

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Nicholas Chammas
FYI I believe we have an open correctness issue here: https://issues.apache.org/jira/browse/SPARK-25150 However, it needs review by another person to confirm whether it is indeed a correctness issue (and whether it still impacts this latest RC). Nick 2018년 10월 10일 (수) 오후 3:14, Jean Georges

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Nicholas Chammas
I believe -1 votes are merited only for correctness bugs and regressions since the previous release. Does SPARK-23200 count as either? 2018년 9월 17일 (월) 오전 9:40, Stavros Kontopoulos < stavros.kontopou...@lightbend.com>님이 작성: > -1 > > I would like to see:

Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Nicholas Chammas
As Reynold pointed out, we don't have to drop Python 2 support right off the bat. We can just deprecate it with Spark 3.0, which would allow us to actually drop it at a later 3.x release. On Sat, Sep 15, 2018 at 2:09 PM Erik Erlandson wrote: > On a separate dev@spark thread, I raised a question

Re: Python friendly API for Spark 3.0

2018-09-14 Thread Nicholas Chammas
Do we need to ditch Python 2 support to provide type hints? I don’t think so. Python lets you specify typing stubs that provide the same benefit without forcing Python 3. 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성: > > > On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson wrote: > >> To be clear,

Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-29 Thread Nicholas Chammas
Dunno if I made a silly mistake, but I wanted to bring some attention to this issue in case there was something serious going on here that might affect the upcoming release. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25150 Nick

Re: Review notification bot

2018-07-22 Thread Nicholas Chammas
e worth mentioning nonetheless. On Sat, Jul 14, 2018 at 11:17 AM Holden Karau wrote: > Hearing no objections (and in a shout out to @ Nicholas Chammas who > initially suggested mention-bot back in 2016) I've set up a copy of mention > bot and run it against my own repo (looks like &

  1   2   3   4   5   >