Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
@gatorsmile The last build was killed by SIGKILL. Can you start a new
build, please?
---
-
To unsubscribe, e-mail: reviews
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
@gatorsmile I've addressed many of your points in today's commits. Can you
please take a look at what I've done so far? I'm still working on the PRs you
requested
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/19410
Hi @szhem. Thanks for the kind reminder and thanks for your contribution.
I'm sorry I did not respond sooner.
I no longer work where I regularly used the checkpointing code with large
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
@gatorsmile I've removed the changes to the files as you requested. This
removes support for schema pruning on filters of queries. I've pushed the
previous revision to a new branch in our `spark
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r199648692
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
---
@@ -182,18 +182,20 @@ private[parquet
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/19410
Hi @szhem. I dug deeper and think I understand the problem better.
To state the obvious, the periodic checkpointer deletes checkpoint files of
RDDs that are potentially still accessible
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r199631341
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
---
@@ -47,16 +47,25 @@ import
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r199643803
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
---
@@ -71,9 +80,22 @@ private[parquet
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
@DaimonPl I'm going to resolve the merge conflicts shortly. Otherwise, I
have no intention of making further modifications to this PR outside of further
review
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
@viirya I've rebased to resolve conflicts. All tests are passing. Can you
take another look and sign off?
Cheers
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
I'd just suggest trying it. Since this PR is a patch for master, please
message me personally at m...@allman.ms to discuss progress and questions
on a backport to 2.2. If we get it working
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> However, I am -1 on merging a change this large after branch cut.
It's disappointing, but I agree we can't merge a change this large into a
branch cut. It will have to wait for 2.
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
@mallman Yes, the issue with window functions is reproducible even with
this PR.
Can you attach a (small) parquet file I can use to test this scenario
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @gatorsmile @HyukjinKwon @ajacques I'm seeing incorrect results from the
following simple projection query, given @jainaks test file:
>
>```
>select page.url, page fr
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> These test failures are in Spark streaming. Is this just an intermittent
test failure or actually caused by this PR?
I was able to run the first failing test successfully. Can we
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> I was able to run the first failing test successfully. Can we get a
retest, please?
@ajacques I just rebased and pushed my branch off of master. Perhaps the
easiest thing to do wo
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Anybody else able to reproduce this failure? It succeeded on my developer
machine.
It worked for me, too. Let's see what a retest d
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
Oh dear. I don't know why we're getting all these sigkills. I think we're
going to need another retest...
---
-
To unsubscribe
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
Hi @gatorsmile. Where do you see us at this point? Do you still want to get
this into Spark 2.4?
---
-
To unsubscribe, e-mail
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> @mallman, if we are all happy here, mind taking a look
https://github.com/apache/spark/pull/21320#issuecomment-408271470 and
https://github.com/apache/spark/pull/21320#issuecomment-406765
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
Sorry it's taken me a couple of days to respond. I needed the time to
ruminate (and not). I could write voluminously, but I just want to reply to a
couple of points and move on.
> I th
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Thanks @jainaks for the sample file and instructions to reproduce the
problem. I will investigate and reply.
@gatorsmile @HyukjinKwon @ajacques I'm seeing incorrect results f
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
Thanks @jainaks for the sample file and instructions to reproduce the
problem. I will investigate and reply
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Few comments like #21320 (comment) or ^ are not minor or nits. I leave
hard -1 if they are not addressed.
I'm sorry to say I'm very close to hanging up this PR. I put a lot of care,
t
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Where does that leave both of these PRs? Do we still want this one with
the code refactoring or to go back to the original? Are there any comments for
this PR that would block merging? I've
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
Success! Now where do we stand, @gatorsmile @HyukjinKwon ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Are there any other blockers to enabling this by default now that
@mallman fixed the currently known broken queries?
The functionality exercised by the ignored t
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> The tests as committed pass for me, but I removed the order by id and I
got that error. Are you saying it works with the specific query in my comment?
Oh! I didn't notice you chan
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> @mallman: I've rebased on top of your changes and pushed. I'm seeing the
following:
That test passes for me locally. Also, I inspected your branch and could
not find any err
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21889#discussion_r207718734
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
---
@@ -0,0 +1,205
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21889#discussion_r207719862
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
---
@@ -0,0 +1,205
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> @mallman: I've rebased on top of your changes and pushed. I'm seeing the
following
That's the test case that I "unignored". It was passing. There must be some
simple e
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Test build #94228 has finished for PR 21889 at commit 92901da.
The test failure appears to be unrelated to this PR.
Is it just me or has the test suite become flakier in the p
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> The tests as committed pass for me, but I removed the order by id and I
got that error. Are you saying it works with the specific query in my comment?
@ajacques Please try this qu
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
I've pushed a commit to restore the original test coverage while also
ensuring determinism of the output. Don't ask me how I did it. It's a secret!
The test that was failing before
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> select id, name.middle, address from temp - Works
> select name.middle, address from temp - Fails
> select name.middle from temp - Works
> select name.middle, id, addre
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> @mallman Is it related to this revert in ParquetReadSupport.scala? I
re-added this logic and all 32 tests in ParquetSchemaPruningSuite passed.
Yes. That's what we need to w
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> This patch fails Scala style tests.
Hi @ajacques. I'm not sure if you're aware of this, but you can run the
scalastyle checks locally with
```
sbt scalast
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Alright to make sure we're all on the same page, it sounds like we're
ready to merge this PR pending:
>
> * Successful build by Jenkins
> * Any PR comments from
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
>> @mallman, while we wait for the go-no-go, do you have the changes for
the next PR ready? Is there anything you need help with?
> I have the hack I used originally, but I have
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> @mallman, while we wait for the go-no-go, do you have the changes for the
next PR ready? Is there anything you need help with?
I have the hack I used originally, but I haven't tr
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
@ajacques Please rebase off my branch.
@gatorsmile I don't recall seeing that error before. Any idea for how I can
reproduce and debug
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Assuming from #21889 (comment), we shouldn't have any identified bug
here. What kind of bugs left to be fixed?
That bug was address by b50ddb4. We still need to fix the bug underly
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21889#discussion_r208446828
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
---
@@ -0,0 +1,205
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
See https://github.com/apache/spark/pull/21320#issuecomment-406353694 for
@gatorsmile's request to move the changes to `ParquetReadSupport.scala` to
another PR.
There was another
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> just for clarification, so now .. there no outstanding bugs, some tests
are ignored per #21320 (comment) and left comments were mostly addressed. Did i
understand correctly?
The igno
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
@ajacques I added a commit to enable schema pruning by default. It's a
little more complete than your commit to do the same. Please rebase off my
branch and remove your commit
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Then should we keep this one or #21889? shall we deduplicate the efforts?
I requested to open that because this looks going to be inactive per your
comments.
As I stated before, I
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @mallman if you're planning on making more code changes, would you be
willing to work on a shared branch or something? I've been working to
incorporate the CR comments.
No, howe
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
> Due to the urgency of the upcoming 2.4 code freeze, I'm going to open
this PR to collect any feedback. This can be closed if you prefer to continue
to the work in the original
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
>> Hello, we've been using your patch at Stripe and we've found something
that looks like a new bug:
>
> Thank you for sharing this, @xinxin-stripe. This is very helpful. I will
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Hello, we've been using your patch at Stripe and we've found something
that looks like a new bug:
Thank you for sharing this, @xinxin-stripe. This is very helpful. I will
investig
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
>> the window of opportunity to review syntax and style in this PR closed
long ago.
> Why/when is this window closed? Who closed that?
What I wrote above is a coarse approximat
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> I see no point of leaving this PR open.
I don't agree with you on that point, and I've expressed my view in
https://github.com/apache/spark/pull/21889#issuecomment-413655
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
Essentially, this PR was created to take the management of #21320 out of my
hands, with a view towards facilitating its incorporation into Spark 2.4. It
was my suggestion, one based in frustration
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @mallman, can we close this PR? Are you willing to update here or not?
I pushed an update less than a day ago, and I intend to continue pushing
updates as nee
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21889
Are we waiting for @gatorsmile's go-ahead and merge?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Try this when spark.sql.nestedSchemaPruning.enabled is on?
This is a case-sensitivity issue (obviously). I'll get to the root of it.
Tha
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Try this when spark.sql.nestedSchemaPruning.enabled is on?
I don't think this will be difficult to fix. I'm working on it now and will
add relevant test cover
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r212388958
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
---
@@ -202,11 +204,15 @@ private[parquet
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r212396370
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
---
@@ -202,11 +204,15 @@ private[parquet
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @mallman Could you remove the changes made in ParquetRowConverter.scala
and also turn off spark.sql.nestedSchemaPruning.enabled by default in this PR?
D
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
Thanks everyone for your contributions, support and patience. It's been a
journey and a half, and I'm excited for the future. I will open a follow-on PR
to address the current known failure
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
@gatorsmile Any concerns about merging this PR at this point?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Add some test cases when turning on spark.sql.caseSensitive?
Will do.
---
-
To unsubscribe, e-mail: reviews-unsub
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
@gatorsmile How does this look?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> I think if the tests are few, you can make them ignored for now here, and
make another PR enabling it back with the changes in ParquetReadSupport.scala.
That's the approach I've ta
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> BTW, I am trying to take a look closely. I would appreciate if there are
some concrete examples so that I (or other reviewers) can double check along.
Parquet is pretty core fix and let's be v
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r204208518
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala
---
@@ -0,0 +1,387 @@
+/*
+ * Licensed
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @mallman I still think we need to split it to two PRs. To resolve the
issues you mentioned above, how about creating a separate PR? Only 10 days left
before the code freeze of Spark 2.4. We p
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
The test failure is unrelated to this patch. Shall we retest?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r204209033
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1298,8 +1298,18 @@ object SQLConf {
"issues.
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Could we move the changes made in ParquetReadSupport.scala to a separate
PR? Then, we can merge this PR very quickly.
If I remove the changes to `ParquetReadSupport.scala`, then f
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @mallman I still think we need to split it to two PRs. To resolve the
issues you mentioned above, how about creating a separate PR? Only 10 days left
before the code freeze of Spark 2.4. We p
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r204206072
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
---
@@ -0,0 +1,156
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205020970
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
---
@@ -71,9 +80,22 @@ private[parquet
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205021140
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
---
@@ -0,0 +1,156
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205021282
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/ProjectionOverSchema.scala
---
@@ -0,0 +1,62 @@
+/*
+ * Licensed
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205021469
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
---
@@ -0,0 +1,153
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205021712
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala
---
@@ -0,0 +1,388 @@
+/*
+ * Licensed
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205022799
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
---
@@ -0,0 +1,156
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205022895
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
---
@@ -0,0 +1,153
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r205022974
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala
---
@@ -0,0 +1,388 @@
+/*
+ * Licensed
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Regarding #21320 (comment), can you at least set this enable by default
and see if some existing tests are broken or not?
I have no intention to at this po
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> gentle ping @mallman since the code freeze is close
Outside of my primary occupation, my top priority on this PR right now is
investigating
https://github.com/apache/spark/pull/21
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
>> Hi @jainaks. Thanks for your report. Do you have the same problem
running your test with this PR?
> @mallman Yes, the issue with window functions is reproducible even with
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/15673#discussion_r216037341
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala ---
@@ -586,17 +587,31 @@ private[client] class Shim_v0_13 extends
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
Hi @viirya,
Thanks for this PR! I have an alternative implementation which I'd like to
submit for comparison. My implementation was something I removed from my
original patch.
I
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
> @mallman It will be great that we can have this fix in 2.4 release as
this can dramatically reduce the data being read in many applications which is
the purpose of the original work.
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
> FYI, per further checking code and discussion with @dbtsai regarding with
predicate pushdown, we know that predicate push down only works for primitive
types on Parquet datasource. So b
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/22357#discussion_r216545091
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
---
@@ -110,7 +110,17 @@ private[sql
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
I have reconstructed my original patch for this issue, but I've discovered
it will require more work to complete. However, as part of that reconstruction
I've discovered a couple of cases where our
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/19410
Hi @szhem.
I understand you've put a lot of work into this implementation, however I
think you should try a simpler approach before we consider something more
complicated. I believe
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r201863251
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
---
@@ -71,9 +80,22 @@ private[parquet
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r201863353
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
---
@@ -0,0 +1,153
Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/21320#discussion_r201863463
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala
---
@@ -0,0 +1,134 @@
+/*
+ * Licensed
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/19410
Hi @szhem.
Thanks for the information regarding disk use for your scenario. What do
you think about my second point, using the `ContextCleaner
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
@viirya Please amend
https://github.com/apache/spark/blob/d684a0f30599d50061ef78ec62edcdd3b726e2d9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22394
FYI @viirya @dbtsai @gatorsmile @HyukjinKwon
Can I get someone's review of this PR please? The unmasked failures appear
to be false positives, so no changes to the tested code
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
And FYI this is the Jira issue I promised in
https://github.com/apache/spark/pull/22357#issuecomment-419940228
yesterday: https://issues.apache.org/jira/browse/SPARK-25407
501 - 600 of 656 matches
Mail list logo