[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-782206308 @cloud-fan I have updated the tests PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-781590779 > the new fix LGTM. Do you know why the test didn't expose that bug previously? The old test was just comparing the row counts and not the actual data. I have updated the test now to read the data and compare it with the expected data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-773471709 > Do we need to change this part? > > https://github.com/apache/spark/blob/3a361cd837eeea5b5c82b0f90f5d1987a8a30328/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L383-L392 I don't think we do. The reason is that the dictionary was initialized correctly after we changed the descriptor so we shouldn't have a problem decoding it. I could be wrong, @revans2 do you agree? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-773471709 > Do we need to change this part? > > https://github.com/apache/spark/blob/3a361cd837eeea5b5c82b0f90f5d1987a8a30328/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L383-L392 I don't think we do. The reason is that the dictionary was initialized correctly after we changed the descriptor so we shouldn't have a problem decoding it. I could be wrong, @revans2 do you agree? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-771216519 > @razajafri, do you mind clarifying PR description? For exmaple, I thought you meant writing out to files or somewhere by: > > > Spark should read it as a long but write it as an int by downcasting it and calling the appropriate method to set the integer value > > The change itself looks making sense but my question is how parquet-mr handles this case. Vectorized readers are supposed to contribute back to Parquet side so it would be great to know how they handle and we match it. It's more because it looks a bit odd to me that we should manipulate the column descriptor. I have updated the description. PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-771025263 @tgravescs I have updated the test with comment that you recommeded, PTAL @cloud-fan do you have any other questions? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-769380453 > Another simpler idea is to fix the schema inference: > > https://github.com/apache/spark/blob/3a361cd837eeea5b5c82b0f90f5d1987a8a30328/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L109-L121 > > For INT64, we should make sure the inferred `DecimalType` is a long decimal. Then we will allocate long column vectors and get rid of this issue. It's also probably more efficient, as there is no down-casting and the space waste is not a big deal. I like the idea but how will you ensure the DecimalType is inferred as LongDecimal? When Spark initializes the `WritableColumnVector` it's basing it on precision. Can you please show how we can ensure that while making sure the correct precision is set? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768845586 > Thank you for making a PR, @razajafri . Could you rebase this PR to the master branch please? I have rebased. PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768845356 > @razajafri, do you mind clarifying PR description? For exmaple, I thought you meant writing out to files or somewhere by: > > > Spark should read it as a long but write it as an int by downcasting it and calling the appropriate method to set the integer value > > The change itself looks making sense but my question is how parquet-mr handles this case. Vectorized readers are supposed to contribute back to Parquet side so it would be great to know how they handle and we match it. It's more because it looks a bit odd to me that we should manipulate the column descriptor. Sorry about the confusion, I have updated the documentation. For your other question, the bug is in how we are writing value to `WritableColumnVector` not while reading from `VectorizedValueReader`. This is why Parquet-mr doesn't have to worry about this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768618314 @revans2 @tgravescs is there anything else you guys need me to look at in this? This is my first PR in Spark, where do we go from here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768517258 @revans2 I ran a test manually with two files with 1M records written with Spark 3.0.0. They were read in with Spark-3.0.0, Spark-3.1 and with master with my fix. Each file was read in 3 times, I used spark.time to time the read which isn't the best way I know but still gives us a ball park number File 1 contains rows of [Decimal(18,0), Decimal(7,3), Decimal(7,7), Decimal(12,2)]. Read times avg ms are spark-3.0: 3960 ms spark-3.1: 4262 ms spark-master-with-fix: 4129 ms File 2 contains rows of [Decimal(12,2)] spark-3.0: 683 ms spark-3.1: 668 ms spark-master-with-fix: 638 ms I don't know if/how we can automate a unit test for this. Let me know what you think This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768424921 > > ... but write it as an int by downcasting it ... > > @razajafri would you mind pointing out where it happens? Sure, here is the original [code](https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224) which initializes the `WritableColumnVector` which is based on the batchSchema which will initialize it to an Int. And [here](https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L370) you can see it tries to call `putLong` to a vector backed by an Int vector which will blow up. and [here](https://github.com/apache/spark/pull/31284/files#diff-a01e174e178366aadf07f64ee690d47d343b2ca416a4a2b2ea735887c22d5934R365) is the change that I made to downcast it before writing it as an Int This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768074328 > The code looks fine to me but do you have any tests to show what the performance impact might be to other parquet files that have decimal values written as they typically are by Spark? Specifically 64-bit decimal values. I missed your comment. I will work on getting you that test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-768066093 > I think this can be revised a bit to make it more understandable. I guess another approach is to initialize the `WriteColumnVector` to use `long` array instead of `int`. It will use more memory but perhaps performance will be better, and the code could be much simpler. So, I thought about doing it that way but I didn't think it will be beneficial to have the general-case implementation suffer from using more memory for a fringe case such as this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-765743754 > I'm a bit confused by your description, it would be nice to add more detail. looking at the code I think what you are saying is that you read it as a long from the parquet file but then downcast it to an int when writing it to the column, correct? If you don't downcast to an INT then the ONHeapColumnVector blows up - but why does that blow up? > > perhaps clarify what you mean by write it when you say " Spark will read it as a long but write it as an int by downcasting it." @tgravescs I have updated the description on this PR does it make more sense now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-765022515 @tgravescs @revans2 PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up
razajafri commented on pull request #31284: URL: https://github.com/apache/spark/pull/31284#issuecomment-765022515 @tgravescs @revans2 PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org