megri commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-2069863018
I am experiencing the same issues, using the same setup as paulpaul1076.
Thanks to this discussion I also tried changing from S3FileIO to the default
and so far it seems to be working
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1970723432
@ajantha-bhat you don't need large scale to reproduce it at all. For me this
problem started happening after the first run of rewrite_data_files. The second
run started failing
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1970719190
@RussellSpitzer thanks a lot for helping with this. Want to give a bit more
details (we discussed this with Russell in iceberg slack).
This is how I would load my catalog
ajantha-bhat commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1970555983
Thanks for helping in narrowing down @RussellSpitzer 👍
We still need to figureout the solution to this problem. But I am not sure
how to reproduce locally with small data.
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1969253762
@ajantha-bhat yea, so, wondering, maybe there are some settings that could
be tuned to let this work in spark SQL.
The thing is, I ran both Spark DSL and Spark SQL and com
ajantha-bhat commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1969247917
I don't think this is related to catalogs.
Catalogs just keep track of table metadata file. Here the callstack is about
spark reading the parquet file from storage using
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1969238526
Just to update this. We deployed the Nessie catalog in prod and this issue
persists for some odd reason.
--
This is an automated message from the Apache Git Service.
To respon
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1949240582
@nastra unfortunately this doesn't seem to be the only reason for the
content-length exception. We now discovered that it still fails, even though I
stopped using the direct str
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1944320729
Discussed in slack that this is due to iceberg's streaming writer not being
unique, this PR should fix this: https://github.com/apache/iceberg/pull/9255,
waiting for iceberg 1.5
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1944183346
@nastra So, I seem to have discovered new info about what's going on. For
some reason in Iceberg metadata there are 2 entries of the same file:
![image](https://github.co
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1943746314
@nastra these are the logs from the driver that does compaction and fails
with this content length exception, and from one of the executors:
[logs.zip](https://github.com/
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1942611489
@nastra thank you very much, I will try tomorrow and let you know!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitH
nastra commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1941427126
@paulpaul1076 I don't have an Airflow setup but I ran a streaming job
locally and created 4000+ files.
The specific setup I used was from the [Spark quickstart
example](https:
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1937083012
@nastra The easiest way to reproduce it is just use my streaming job, just
leave it running, maybe for a few days even. And also schedule in airflow the
compaction job to run ev
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1937082611
Let me know if you manage to do it or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above
nastra commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1936929767
Thanks @paulpaul1076, I will try and reproduce this next week on my end
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHu
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1936244579
Btw, as I said the Scala DSL for compaction works, Spark SQL doesn't.
I compared the job parameters in the Spark UI tab, they are absolutely
identical, so, it's not like t
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1936239049
@nastra where should I upload the data for you? I will upload it, then you
can register the table in your catalog. I used hive catalog, but I don't think
it matters.
Anyw
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1933903982
Basically I had a streaming job that was streaming small files. Then I
stopped it, tried compacting, and it failed with these content-length
exceptions. I'll try to find some fr
nastra commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1933750195
> Anyways, got it to work, now there's a similar exception, but written a
bit different:
>
> ```
> org.apache.iceberg.exceptions.RuntimeIOException: java.io.EOFException:
Re
nastra commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1933702701
> Looks like iceberg-aws-bundle doesn't have this class:
>
> `Exception in thread "main" java.lang.NoClassDefFoundError:
software/amazon/awssdk/http/urlconnection/UrlConnectionH
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1932587930
Anyways, got it to work, now there's a similar exception, but written a bit
different:
```
org.apache.iceberg.exceptions.RuntimeIOException: java.io.EOFException:
Reac
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1932570894
Looks like iceberg-aws-bundle doesn't have this class:
`Exception in thread "main" java.lang.NoClassDefFoundError:
software/amazon/awssdk/http/urlconnection/UrlConnectionH
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1932550064
I just need to set a spark option like this, right:
`spark.sql.catalog.my_catalog.http-client.type=urlconnection`
?
--
This is an automated message from the Apac
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1932545715
Yea, I can try with that setting, where do I set it, by the way? Do I have
to rebuild iceberg jars?
The problem is not the RewriteDataFiles Spark action, it's the procedur
nastra commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1932300265
@paulpaul1076 do you have a chance to try with
`http-client.type=urlconnection`? It's of course also possible that there's a
bug in `RewriteDataFilesSparkAction` that went unnoticed.
paulpaul1076 commented on issue #9679:
URL: https://github.com/apache/iceberg/issues/9679#issuecomment-1932287892
@nastra the Scala code works fine, the problem is inside iceberg.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHu
27 matches
Mail list logo