[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-02-19 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-782206308


   @cloud-fan I have updated the tests PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-02-18 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-781590779


   > the new fix LGTM. Do you know why the test didn't expose that bug 
previously?
   
   The old test was just comparing the row counts and not the actual data. I 
have updated the test now to read the data and compare it with the expected 
data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-02-05 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-773471709


   > Do we need to change this part?
   > 
   > 
https://github.com/apache/spark/blob/3a361cd837eeea5b5c82b0f90f5d1987a8a30328/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L383-L392
   
   I don't think we do. The reason is that the dictionary was initialized 
correctly after we changed the descriptor so we shouldn't have a problem 
decoding it. I could be wrong, @revans2 do you agree? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-02-04 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-773471709


   > Do we need to change this part?
   > 
   > 
https://github.com/apache/spark/blob/3a361cd837eeea5b5c82b0f90f5d1987a8a30328/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L383-L392
   
   I don't think we do. The reason is that the dictionary was initialized 
correctly after we changed the descriptor so we shouldn't have a problem 
decoding it. I could be wrong, @revans2 do you agree? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-02-01 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-771216519


   > @razajafri, do you mind clarifying PR description? For exmaple, I thought 
you meant writing out to files or somewhere by:
   > 
   > > Spark should read it as a long but write it as an int by downcasting it 
and calling the appropriate method to set the integer value
   > 
   > The change itself looks making sense but my question is how parquet-mr 
handles this case. Vectorized readers are supposed to contribute back to 
Parquet side so it would be great to know how they handle and we match it. It's 
more because it looks a bit odd to me that we should manipulate the column 
descriptor.
   
   I have updated the description. PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-02-01 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-771025263


   @tgravescs I have updated the test with comment that you recommeded, PTAL
   
   @cloud-fan do you have any other questions? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-28 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-769380453


   > Another simpler idea is to fix the schema inference:
   > 
   > 
https://github.com/apache/spark/blob/3a361cd837eeea5b5c82b0f90f5d1987a8a30328/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L109-L121
   > 
   > For INT64, we should make sure the inferred `DecimalType` is a long 
decimal. Then we will allocate long column vectors and get rid of this issue. 
It's also probably more efficient, as there is no down-casting and the space 
waste is not a big deal.
   
   I like the idea but how will you ensure the DecimalType is inferred as 
LongDecimal? When Spark initializes the `WritableColumnVector` it's basing it 
on precision. Can you please show how we can ensure that while making sure the 
correct precision is set? 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-27 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768845586


   > Thank you for making a PR, @razajafri . Could you rebase this PR to the 
master branch please?
   
   I have rebased. PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-27 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768845356


   > @razajafri, do you mind clarifying PR description? For exmaple, I thought 
you meant writing out to files or somewhere by:
   > 
   > > Spark should read it as a long but write it as an int by downcasting it 
and calling the appropriate method to set the integer value
   > 
   > The change itself looks making sense but my question is how parquet-mr 
handles this case. Vectorized readers are supposed to contribute back to 
Parquet side so it would be great to know how they handle and we match it. It's 
more because it looks a bit odd to me that we should manipulate the column 
descriptor.
   
   Sorry about the confusion, I have updated the documentation. 
   
   For your other question, the bug is in how we are writing value to 
`WritableColumnVector` not while reading from `VectorizedValueReader`. This is 
why Parquet-mr doesn't have to worry about this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-27 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768618314


   @revans2 @tgravescs is there anything else you guys need me to look at in 
this? This is my first PR in Spark, where do we go from here? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-27 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768517258


   @revans2 I ran a test manually with two files with 1M records written with 
Spark 3.0.0. They were read in with Spark-3.0.0, Spark-3.1 and with master with 
my fix. Each file was read in 3 times, I used spark.time to time the read which 
isn't the best way I know but still gives us a ball park number
   
   File 1 contains rows of [Decimal(18,0), Decimal(7,3), Decimal(7,7), 
Decimal(12,2)]. Read times avg ms are 
   spark-3.0: 3960 ms 
   spark-3.1: 4262 ms
   spark-master-with-fix: 4129 ms
   
   File 2 contains rows of [Decimal(12,2)]
   spark-3.0: 683 ms 
   spark-3.1: 668 ms
   spark-master-with-fix: 638 ms
   
   I don't know if/how we can automate a unit test for this. Let me know what 
you think



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-27 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768424921


   > > ... but write it as an int by downcasting it  ...
   > 
   > @razajafri would you mind pointing out where it happens?
   
   Sure, here is the original 
[code](https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224)
 which initializes the `WritableColumnVector` which is based on the batchSchema 
which will initialize it to an Int. 
   
   And 
[here](https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java#L370)
 you can see it tries to call `putLong` to a vector backed by an Int vector 
which will blow up. 
   
   and 
[here](https://github.com/apache/spark/pull/31284/files#diff-a01e174e178366aadf07f64ee690d47d343b2ca416a4a2b2ea735887c22d5934R365)
 is the change that I made to downcast it before writing it as an Int 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-26 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768074328


   > The code looks fine to me but do you have any tests to show what the 
performance impact might be to other parquet files that have decimal values 
written as they typically are by Spark? Specifically 64-bit decimal values.
   
   I missed your comment. I will work on getting you that test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-26 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-768066093


   > I think this can be revised a bit to make it more understandable. I guess 
another approach is to initialize the `WriteColumnVector` to use `long` array 
instead of `int`. It will use more memory but perhaps performance will be 
better, and the code could be much simpler.
   
   So, I thought about doing it that way but I didn't think it will be 
beneficial to have the general-case implementation suffer from using more 
memory for a fringe case such as this. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-22 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-765743754


   > I'm a bit confused by your description, it would be nice to add more 
detail. looking at the code I think what you are saying is that you read it as 
a long from the parquet file but then downcast it to an int when writing it to 
the column, correct? If you don't downcast to an INT then the 
ONHeapColumnVector blows up - but why does that blow up?
   > 
   > perhaps clarify what you mean by write it when you say " Spark will read 
it as a long but write it as an int by downcasting it."
   
   @tgravescs I have updated the description on this PR does it make more sense 
now? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-21 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-765022515


   @tgravescs @revans2 PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] razajafri commented on pull request #31284: [SPARK-34167][SQL]Reading parquet with IntDecimal written as a LongDecimal blows up

2021-01-21 Thread GitBox


razajafri commented on pull request #31284:
URL: https://github.com/apache/spark/pull/31284#issuecomment-765022515


   @tgravescs @revans2 PTAL



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org