[GitHub] spark pull request: [SPARK-8848] [SQL] [WIP] Refactors Parquet wri...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/7679#issuecomment-126616847 @liancheng I can transfer tables from mysql -> parquet, including unsigned bigint -> DECIMAL(20) (YEAH!). I couldn't find any problems by reading the code (yet). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/7455#issuecomment-125324116 @liancheng thank you for your time reviewing and implementing this! :bow: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-122529489 @liancheng sure, I just wasn't sure if it should be closed :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer closed the pull request at: https://github.com/apache/spark/pull/6796 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/7455#discussion_r34875478 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala --- @@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with ParquetTest { // Parquet doesn't allow column names with spaces, have to add an alias here .select($"_1" cast decimal as "dec") -for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17))) { +for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17), (19, 0), (38, 37))) { --- End diff -- Sounds like a changelog entry to me. (We can now write parquet files for Decimal >18, so please check compatibility if you use spark parquet files elsewhere) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/7455#discussion_r34872908 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala --- @@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with ParquetTest { // Parquet doesn't allow column names with spaces, have to add an alias here .select($"_1" cast decimal as "dec") -for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17))) { +for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17), (19, 0), (38, 37))) { --- End diff -- I would prefer the way you currently wrote it. I don't see a point in keeping a "store it in a way that an older version can read" flag. You should always try a new version and then use it for real storage. And reading files written by old spark version will always be possible. PS: solved the test thing. It looks like spark sbt somehow managed to use a local 2.9.6 scalac 0.o --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/7455#discussion_r34870594 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala --- @@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with ParquetTest { // Parquet doesn't allow column names with spaces, have to add an alias here .select($"_1" cast decimal as "dec") -for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17))) { +for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17), (19, 0), (38, 37))) { --- End diff -- Ah, yes, you removed the < 8 check too. But shouldn't followParquetFormatSpec = false generate compatible files? I'm getting a compiler assertion on test compile and I tried cleaning :-S Anyway, must be some sort of local **. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/7455#discussion_r34868189 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala --- @@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with ParquetTest { // Parquet doesn't allow column names with spaces, have to add an alias here .select($"_1" cast decimal as "dec") -for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17))) { +for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 17), (19, 0), (38, 37))) { --- End diff -- This should fail if due to followParquetFormatSpec = false. I was trying to test "withSQLConf" but couldn't get it to work: https://github.com/apache/spark/pull/6796/files#diff-82fab6131b7092c5faa4064fd04c3d72R135 (I have to find out why I can't run tests locally, ./build/sbt sql/test fails with a compiler assertion?!?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-119301151 The writeDecimal method is rather ugly, and the write path needs to know if we follow parquet style or not as this implies a different encoding (addInteger / addLong). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-119170768 Hi @liancheng, I'm rebasing on you PR right now. I can work for ~1-2h / day on this PR so feel free to take over the PR if this blocks anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-116769197 Hi @liancheng, thank you for the thorough review, will push a reworked version soon. Everything sounds reasonable :-) With "private" Settings I meant that I can't change the setting in the shell because it's marked as "isPublic = false" in https://github.com/liancheng/spark/blob/2a2062d3f530ecd26e75b306aee42761d67d8724/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L273 I'm not sure if that's intended. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r33420742 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala --- @@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter( .length(minBytesForPrecision(precision)) .named(field.name) - case dec @ DecimalType() if !followParquetFormatSpec => -throw new AnalysisException( - s"Data type $dec is not supported. " + -s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} is set to false," + -"decimal precision and scale must be specified, " + -"and precision must be less than or equal to 18.") - // = // Decimals (follow Parquet format spec) // = - // Uses INT32 for 1 <= precision <= 9 + // Uses INT32 for 4 byte encodings / precision <= 9 case DecimalType.Fixed(precision, scale) -if precision <= maxPrecisionForBytes(4) && followParquetFormatSpec => +if followParquetFormatSpec && maxPrecisionForBytes(3) < precision && + precision <= maxPrecisionForBytes(4) => --- End diff -- We had a debate about using the most compact storage type if possible. As such INT32 looses compared to a 3 byte fixed length array. Am 28. Juni 2015 10:59:15 MESZ, schrieb Cheng Lian : >>case DecimalType.Fixed(precision, scale) >> -if precision <= maxPrecisionForBytes(4) && >followParquetFormatSpec => >> +if followParquetFormatSpec && maxPrecisionForBytes(3) < >precision && >> + precision <= maxPrecisionForBytes(4) => > >Why do we want `maxPrecisionForBytes(3) < precision` here? Did I miss >something? > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/6796/files#r33420647 -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r33420719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala --- @@ -169,11 +169,12 @@ private[parquet] class CatalystSchemaConverter( } case INT96 => -CatalystSchemaConverter.analysisRequire( - assumeInt96IsTimestamp, - "INT96 is not supported unless it's interpreted as timestamp. " + -s"Please try to set ${SQLConf.PARQUET_INT96_AS_TIMESTAMP.key} to true.") -TimestampType +field.getOriginalType match { + case DECIMAL => makeDecimalType(maxPrecisionForBytes(12)) + case _ if assumeInt96IsTimestamp => TimestampType + case null => makeDecimalType(maxPrecisionForBytes(12)) + case _ => illegalType() +} --- End diff -- Didn't know about the deprecation, will drop it. Am 28. Juni 2015 10:56:00 MESZ, schrieb Cheng Lian : >> @@ -169,11 +169,12 @@ private[parquet] class CatalystSchemaConverter( >> } >> >>case INT96 => >> -CatalystSchemaConverter.analysisRequire( >> - assumeInt96IsTimestamp, >> - "INT96 is not supported unless it's interpreted as >timestamp. " + >> -s"Please try to set >${SQLConf.PARQUET_INT96_AS_TIMESTAMP.key} to true.") >> -TimestampType >> +field.getOriginalType match { >> + case DECIMAL => makeDecimalType(maxPrecisionForBytes(12)) >> + case _ if assumeInt96IsTimestamp => TimestampType >> + case null => makeDecimalType(maxPrecisionForBytes(12)) >> + case _ => illegalType() >> +} > >`INT96` is only used for nanosecond timestamp types for historical >reasons, and is to be deprecated. Let's not use it for decimals. > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/6796/files#r33420609 -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r33420714 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala --- @@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter( .length(minBytesForPrecision(precision)) .named(field.name) - case dec @ DecimalType() if !followParquetFormatSpec => -throw new AnalysisException( - s"Data type $dec is not supported. " + -s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} is set to false," + -"decimal precision and scale must be specified, " + -"and precision must be less than or equal to 18.") - --- End diff -- You said it should usevthe hive default of (10,0) - or did I misinterpret that? Am 28. Juni 2015 10:53:00 MESZ, schrieb Cheng Lian : >> @@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter( >>.length(minBytesForPrecision(precision)) >>.named(field.name) >> >> - case dec @ DecimalType() if !followParquetFormatSpec => >> -throw new AnalysisException( >> - s"Data type $dec is not supported. " + >> -s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} >is set to false," + >> -"decimal precision and scale must be specified, " + >> -"and precision must be less than or equal to 18.") >> - > >We still need this branch to handle the case where precision >information is missing. > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/6796/files#r33420570 -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-115620818 @liancheng it starts to work (compiles and minimal initial test worked, no guarantees). I think there are some points that need feedback - How is the compatibility mode intended to work? Settings are currently private, but I'd like to store Decimal(19), so is lifting the 18 limit correct for compatibility mode? - INT32/INT64 are only used when the byte length matches the byte length for the precision. FIXED_LEN_BYTE_ARRAY will thus e.g. be used to store 6 byte values - FIXED_LEN_BYTE_ARRAY means I'll have to create an array of the correct size. I've increased the scratch_bytes. Not very happy about the code path, do you have better ideas? - BYTES_FOR_PRECISION needs to handle any precision. I've reworked that code. Again, suggestions welcome The patch is now way smaller and less intrusive. Looks like the refactoring was well worth the effort! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-11533 Currently reworking the patch. Here is the warning about the tuple match ``` [warn] /home/rtreffer/work/spark-master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala:334: object Fixed expects 2 patterns to hold (Int, Int) but crushing into 2-tuple to fit single pattern (SI-6675) ``` According to the ticket it's a deprecation warning. https://issues.scala-lang.org/browse/SI-6675 Nothing urgent, but I think it should be fixed at some point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-114835713 @liancheng I'll rebase on your branch, I really like the way you cleaned up toPrimitiveDataType by using a fluent Types interface. This will make this patch way easier. Talking about testing/compatibility/interoperability, I have added a hive-generated parquet file that I'd like to turn into a test case: https://github.com/rtreffer/spark/tree/spark-4176-store-large-decimal-in-parquet/sql/core/src/test/resources/hive-decimal-parquet There are some parquet files attached to tickets in jira, too. Do you plan to convert those into tests? Regarding FIXED_LENGTH_BYTE_ARRAY The overhead would decreases compared to size. BINARY overhead would be <10% from ~DECIMAL(100) and <25% from ~DECIAL(40) (pre-compression). I'd expect DECIMAL(40) to use the full precision only from time to time. But yeah, I've overlooked the 4 byte overhead at https://github.com/Parquet/parquet-format/blob/master/Encodings.md and assumed it would be less, FIXED_LENGTH_BYTE_ARRAY should be good for now (until s.o. complains). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r33133591 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter extends Logging { case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64)) case TimestampType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96)) case DecimalType.Fixed(precision, scale) if precision <= 18 => - // TODO: for now, our writer only supports decimals that fit in a Long Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY, Some(ParquetOriginalType.DECIMAL), Some(new DecimalMetadata(precision, scale)), Some(BYTES_FOR_PRECISION(precision +case DecimalType.Fixed(precision, scale) => + Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY, --- End diff -- Under the assumption that all values will use the full length, yes. But at some point the overhead of the length is low compared to the overhead if someone specifies just the upper bound of values. I have to check if it really uses 4 bytes for BINARY. I'd then raise the threshold to ~40 bytes length. (meaning <=10% worst case overhead before compression) It won't simplify the decoding/writing though, because the <=18 case is used for long decoding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r33129732 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala --- @@ -331,7 +331,7 @@ private[sql] class JDBCRDD( case BooleanType => BooleanConversion case DateType => DateConversion case DecimalType.Unlimited => DecimalConversion(None) - case DecimalType.Fixed(d) => DecimalConversion(Some(d)) + case DecimalType.Fixed(d, s) => DecimalConversion(Some((d, s))) --- End diff -- As said it was only about a warning, not about correctness. I'll drop this change on the next version, it draws too much attention and is not needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r33124241 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala --- @@ -369,9 +371,6 @@ private[parquet] class MutableRowWriteSupport extends RowWriteSupport { case DateType => writer.addInteger(record.getInt(index)) case TimestampType => writeTimestamp(record.getLong(index)) case d: DecimalType => -if (d.precisionInfo == None || d.precisionInfo.get.precision > 18) { - sys.error(s"Unsupported datatype $d, cannot write to consumer") -} --- End diff -- Overseen, bug. We can't serialize without that info, parquet requires it (and mixed scale would be complicated). PS: do you know if there is any interest on allowing mixed Decimal in parquet? Am 24. Juni 2015 09:36:11 MESZ, schrieb Cheng Lian : >> @@ -369,9 +371,6 @@ private[parquet] class MutableRowWriteSupport >extends RowWriteSupport { >>case DateType => writer.addInteger(record.getInt(index)) >>case TimestampType => writeTimestamp(record.getLong(index)) >>case d: DecimalType => >> -if (d.precisionInfo == None || d.precisionInfo.get.precision >> 18) { >> - sys.error(s"Unsupported datatype $d, cannot write to >consumer") >> -} > >Don't we need to consider the case where `d.precisionInfo == None` now? > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/6796/files#r33123740 -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r32892827 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -289,7 +295,20 @@ private[parquet] object ParquetTypesConverter extends Logging { name: String, nullable: Boolean = true, inArray: Boolean = false, + parquetSchema: Option[ParquetType] = None, toThriftSchemaNames: Boolean = false): ParquetType = { + +val parquetElementTypeBySchema = --- End diff -- It also performs a type check / conversion. That's why I've removed it. It would look like this ``` val parquetElementTypeBySchema = parquetSchema.filter(_.isInstanceOf[ParquetGroupType]).filter(_.containsField(name)).map(_.getType(name)) ``` I would settle on collect, does that look ok? ``` val parquetElementTypeBySchema = parquetSchema.collect { case gType : ParquetGroupType if (gType.containsField(name)) => gType.getType(name) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r32891515 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter extends Logging { case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64)) case TimestampType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96)) case DecimalType.Fixed(precision, scale) if precision <= 18 => - // TODO: for now, our writer only supports decimals that fit in a Long Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY, --- End diff -- What would we gain by encoding it that way? We use a minimal length fixed byte array which should provide a similar compact encoding. (DECIMAL(9) should end up as 4 bytes, and smaller decimal values should take even less space) Decoding is a different story, though. PS: I was focusing on DECIMAL with precision >=19. Shouldn't small decimal handling be a new ticket? Am 21. Juni 2015 02:41:44 MESZ, schrieb Davies Liu : >> @@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter >extends Logging { >> case LongType => >Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64)) >> case TimestampType => >Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96)) >> case DecimalType.Fixed(precision, scale) if precision <= 18 => >> - // TODO: for now, our writer only supports decimals that fit >in a Long >> >Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY, > >We should int32, int64 if possible, see >https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal >``` >DECIMAL can be used to annotate the following types: > >int32: for 1 <= precision <= 9 >int64: for 1 <= precision <= 18; precision <= 10 will produce a warning >fixed_len_byte_array: precision is limited by the array size. Length n >can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits >binary: precision is not limited, but is required. The minimum number >of bytes to store the unscaled value should be used. >``` > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/6796/files#r32889441 -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r32891379 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -289,7 +295,20 @@ private[parquet] object ParquetTypesConverter extends Logging { name: String, nullable: Boolean = true, inArray: Boolean = false, + parquetSchema: Option[ParquetType] = None, toThriftSchemaNames: Boolean = false): ParquetType = { + +val parquetElementTypeBySchema = --- End diff -- Ah, yes, an early version had that, I somehow moved to this verbose code O.o Thanks Am 21. Juni 2015 02:33:15 MESZ, schrieb Davies Liu : >>toThriftSchemaNames: Boolean = false): ParquetType = { >> + >> +val parquetElementTypeBySchema = > >These could be >``` >parquetSchema.filter(_.containsField(name)).map(_.getType(name)) >``` > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/6796/files#r32889401 -- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113834732 I've pushed a very early version of a fix. (Literally early, it's nearly 1:00 am. And I'd expect the test build to fail, I'll fix the outstanding issues later today) PS: Loading the hive parquet works now, but I've not yet tested much more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113814084 The problematic line is https://github.com/apache/spark/blob/v1.4.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L105 (calling ParquetTypesConverter.convertFromAttributes) This causes the schema to be normalized, but that does not work for decimal. ``` message hive_schema { optional int32 id; optional fixed_len_byte_array(13) value (DECIMAL(30,0)); } ``` is replaced by ``` message root { optional int32 id; optional binary value (DECIMAL(30,0)); } ``` I removed that line and the loading of the data works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113762279 Ok, I think I'm slowly getting down to the cause strictTypeChecking on https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/ColumnIOFactory.java#L100 creates the exception The relevant class for the job setup is ParquetTableScan (doExecute). I'm not yet sure if this can be fixed on the job setup or on the receiver side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113629961 I've pushed the hive generated parquet file and I'll call it a day. I think I'll have to relax the validation of column types for DECIMAL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113628860 Ok, it looks like I can't open hive generated parquet files, but it looks more like a type error. ``` scala> val hive = sqlContext.load("/home/rtreffer/work/hadoop/hive-parquet") warning: there was one deprecation warning; re-run with -deprecation for details 15/06/19 22:03:26 INFO ParquetFileReader: Initiating action with parallelism: 5 hive: org.apache.spark.sql.DataFrame = [id: int, value: decimal(30,0)] scala> hive.collect.foreach(println) 15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB) 15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_9_piece0 on localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB) 15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(130208) called with curMem=0, maxMem=278019440 15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 127.2 KB, free 265.0 MB) 15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(14082) called with curMem=130208, maxMem=278019440 15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 13.8 KB, free 265.0 MB) 15/06/19 22:03:35 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:42189 (size: 13.8 KB, free: 265.1 MB) 15/06/19 22:03:35 INFO SparkContext: Created broadcast 10 from collect at :38 15/06/19 22:03:35 INFO SparkContext: Starting job: collect at :38 15/06/19 22:03:35 INFO DAGScheduler: Got job 9 (collect at :38) with 1 output partitions (allowLocal=false) 15/06/19 22:03:35 INFO DAGScheduler: Final stage: ResultStage 11(collect at :38) 15/06/19 22:03:35 INFO DAGScheduler: Parents of final stage: List() 15/06/19 22:03:35 INFO DAGScheduler: Missing parents: List() 15/06/19 22:03:35 INFO DAGScheduler: Submitting ResultStage 11 (MapPartitionsRDD[45] at collect at :38), which has no missing parents 15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(5568) called with curMem=144290, maxMem=278019440 15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 5.4 KB, free 265.0 MB) 15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(2964) called with curMem=149858, maxMem=278019440 15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 2.9 KB, free 265.0 MB) 15/06/19 22:03:36 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:42189 (size: 2.9 KB, free: 265.1 MB) 15/06/19 22:03:36 INFO SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:893 15/06/19 22:03:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 11 (MapPartitionsRDD[45] at collect at :38) 15/06/19 22:03:36 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks 15/06/19 22:03:36 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 36, localhost, PROCESS_LOCAL, 1471 bytes) 15/06/19 22:03:36 INFO Executor: Running task 0.0 in stage 11.0 (TID 36) 15/06/19 22:03:36 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1: Input split: ParquetInputSplit{part: file:/home/rtreffer/work/hadoop/hive-parquet/00_0 start: 0 end: 874 length: 874 hosts: []} 15/06/19 22:03:36 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 33 records. 15/06/19 22:03:36 INFO InternalParquetRecordReader: at row 0. reading next block 15/06/19 22:03:36 INFO InternalParquetRecordReader: block read in memory in 0 ms. row count = 33 15/06/19 22:03:36 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 36) org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/home/rtreffer/work/hadoop/hive-parquet/00_0 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$class.foreach(Iterator.scala:750) at scala.collection.AbstractIterator.foreach(Iterator.scala:1202) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.sca
[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113622541 (hive 1.2.0 and hadoop 2.7.0 without hdfs or a cluster) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113621866 Just did a test with hive, I can declare a parquet file written with spark as {code} CREATE EXTERNAL TABLE ptest (id INT, value DECIMAL(30,0)) STORED AS PARQUET LOCATION "file:///home/rtreffer/work/hadoop/parquet"; {code} and it *does* work. I'm now trying to test the opposite direction plus a test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113277495 Yes, manually. I could add the file I was writing afterwards, sounds like a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113276153 I was thinking about this: Create a small parquet file with spark, load it with hive, copy it to a new parquet table with hive, read that with spark. If that matches the input -> win. Otherwise -> some more work. PS: SPARK-8342 / SPARK-8359 were only problems during my initial tests. It's a bit harder to test read/write of Decimal if the implementation has bugs. So those are unrelated to this patch, but they might reduce the usefulness of this patch (you can't do reliable math in the ranges you could now load/save) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-113274389 The only reason for the WIP is that I have not yet cross-tested the interoperability with e.g. hive. It follows the spec, but I'd like to test it (or have s.o. verify this). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/6796#discussion_r32371677 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala --- @@ -331,7 +331,7 @@ private[sql] class JDBCRDD( case BooleanType => BooleanConversion case DateType => DateConversion case DecimalType.Unlimited => DecimalConversion(None) - case DecimalType.Fixed(d) => DecimalConversion(Some(d)) + case DecimalType.Fixed(d, s) => DecimalConversion(Some((d, s))) --- End diff -- It caused a warning on my system. So I thought it would be better to make it explicit. I could drop that line from this patch, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-111713135 Note: I came across https://issues.apache.org/jira/browse/SPARK-8342 while testing, it seems like Decimal math is unsafe at the moment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
GitHub user rtreffer opened a pull request: https://github.com/apache/spark/pull/6796 [SPARK-4176][WIP] Support decimal types with precision > 18 in parquet This is my current WIP on SPARK-4176. It should be compatible with other implementations of parquet. https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal ``` For byte arrays, binary and fixed, the unscaled number must be encoded as two's complement using big-endian byte order (the most significant byte is the zeroth element) ``` This is the default encoding on bigint. It should thus be compatible with other implementations, although it would be great if s.o. could test this. I've tested this locally with powers of 2 up to 2^200 in the spark shell, without errors but Code I've used for (local) testing (on spark shell): ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.Row import java.io.File def decimalList(n : Int) = { require(n > 0) val one = Decimal(1) val two = Decimal(2) one.changePrecision(n,0) two.changePrecision(n,0) val list = scala.collection.mutable.ArrayBuffer(one) while (list.length < n) { val v = list(list.length - 1) list += Decimal(v.toJavaBigDecimal.multiply(two.toJavaBigDecimal),n,0) } list.toList } def decimalRows(l : List[Decimal]) = l.zipWithIndex.map(e => Row(e._2,e._1)) def decimalRowRdd(l : List[Decimal]) = sc.parallelize(decimalRows(l)) def df(n : Int) = { val data = decimalList(n) val len = data.lastOption.get.toString.length val schema = StructType(Array( StructField("id", IntegerType, true), StructField("value", DecimalType(len,0), true))) sqlContext.createDataFrame(decimalRowRdd(data), schema) } def delete(filename : String) : Unit = { val f = new File(filename) if (!f.exists) return if (f.isDirectory) f.listFiles.foreach(_.delete) f.delete } def test(n : Int) = { val src = df(n) delete("/tmp/typetest") src.save("/tmp/typetest") val copy = sqlContext.load("/tmp/typetest") src.collect().sortBy(_.getInt(0)).zip(copy.collect().sortBy(_.getInt(0))).foreach(e => { if (e._1 != e._2) println(s"${e._1} != ${e._2}") }) delete("/tmp/typetest") println(s"Tested 1..2^${n - 1}") } ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/rtreffer/spark spark-4176-store-large-decimal-in-parquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6796.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6796 commit 32630dfa440f618021d5d0a1db86cb7a3ea5e2f4 Author: Rene Treffer Date: 2015-06-13T12:54:42Z [SPARK-4176] Support decimal types with precision > 18 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6789#issuecomment-111690002 Oh, I'm really lucky with with jenkins, seems like it got a network error (or github quote or alike) https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34820/console Rebased on master to trigger another build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/6789#issuecomment-111594673 @viirya pushed the original patch, any objections? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...
GitHub user rtreffer opened a pull request: https://github.com/apache/spark/pull/6789 [SPARK-7897] Improbe type for jdbc/"unsigned bigint" The original fix uses DecimalType.Unlimited, which is harder to handle afterwards. There is no scale and most data should fit into a long, thus DecimalType(20,0) should be better. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rtreffer/spark spark-7897-unsigned-bigint-as-decimal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6789.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6789 commit e99052c5b8866c1f0ec98309c8bc249139552f5a Author: Rene Treffer Date: 2015-06-12T07:39:42Z Fix type for "unsigned bigint" jdbc loading. The original fix uses DecimalType.Unlimited, which is harder to handle afterwards. There is no scale and most data should fit into a long, thus DecimalType(20,0) should be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-102951240 @rxin thank you, found the MimaExcludes, let's see if I've done that correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-102866270 Jenkins failes with ``` [info] spark-sql: found 5 potential binary incompatibilities (filtered 287) [error] * object org.apache.spark.sql.jdbc.DriverQuirks does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.DriverQuirks$") [error] * class org.apache.spark.sql.jdbc.DriverQuirks does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.DriverQuirks") [error] * class org.apache.spark.sql.jdbc.PostgresQuirks does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.PostgresQuirks") [error] * class org.apache.spark.sql.jdbc.NoQuirks does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.NoQuirks") [error] * class org.apache.spark.sql.jdbc.MySQLQuirks does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.MySQLQuirks") ``` but that was the intention of the rename. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-102865899 Rebased on master, squashed, added a case class for the JdbcType and got rid of null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-102583414 It's on my todo list for this weekend. Am 16. Mai 2015 08:15:38 MESZ, schrieb Reynold Xin : >@rtreffer do you have time to update this? > >--- >Reply to this email directly or view it on GitHub: >https://github.com/apache/spark/pull/#issuecomment-102576337 -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on a diff in the pull request: https://github.com/apache/spark/pull/#discussion_r29850737 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala --- @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.jdbc + +import org.apache.spark.sql.types._ +import org.apache.spark.annotation.DeveloperApi + +import java.sql.Types + + +/** + * :: DeveloperApi :: + * Encapsulates everything (extensions, workarounds, quirks) to handle the + * SQL dialect of a certain database or jdbc driver. + * Lots of databases define types that aren't explicitly supported + * by the JDBC spec. Some JDBC drivers also report inaccurate + * information---for instance, BIT(n>1) being reported as a BIT type is quite + * common, even though BIT in JDBC is meant for single-bit values. Also, there + * does not appear to be a standard name for an unbounded string or binary + * type; we use BLOB and CLOB by default but override with database-specific + * alternatives when these are absent or do not behave correctly. + * + * Currently, the only thing done by the dialect is type mapping. + * `getCatalystType` is used when reading from a JDBC table and `getJDBCType` + * is used when writing to a JDBC table. If `getCatalystType` returns `null`, + * the default type handling is used for the given JDBC type. Similarly, + * if `getJDBCType` returns `(null, None)`, the default type handling is used + * for the given Catalyst type. + */ +@DeveloperApi +abstract class JdbcDialect { + /** + * Check if this dialect instance can handle a certain jdbc url. + * @param url the jdbc url. + * @return True if the dialect can be applied on the given jdbc url. + * @throws NullPointerException if the url is null. + */ + def canHandle(url : String): Boolean + + /** + * Get the custom datatype mapping for the given jdbc meta information. + * @param sqlType The sql type (see java.sql.Types) + * @param typeName The sql type name (e.g. "BIGINT UNSIGNED") + * @param size The size of the type. + * @param md Result metadata associated with this type. + * @return The actual DataType (subclasses of [[org.apache.spark.sql.types.DataType]]) + * or null if the default type mapping should be used. + */ + def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): DataType = +null + + /** + * Retrieve the jdbc / sql type for a give datatype. + * @param dt The datatype (e.g. [[org.apache.spark.sql.types.StringType]]) + * @return A tuple of sql type name and sql type, or {{{(null, None)}}} for no change. + */ + def getJDBCType(dt: DataType): (String, Option[Int]) = (null, None) --- End diff -- Sounds like a good idea. It would also clarify the magic getJDBCType(...)._1 and getJDBCType(...)._2 in jdbc.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-99427524 All the feedback regarding this has been addressed (hopefully) and test are green. What should I do next? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-97702611 Still no build :cry: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-96573899 @marmbrus what should we do now? New PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-96204329 @marmbrus I've fixed the linebreaks, but jenkins does not seem to pick up the changes and rerun the tests :-( What should I do? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-96058998 Thank you for the fast feedback, I'm quite impressed by the feedback speed/quality on the spark project. I was traveling last week, so the fix was a bit delayed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks
Github user rtreffer closed the pull request at: https://github.com/apache/spark/pull/5498 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/5498#issuecomment-93963311 Replaced by https://github.com/apache/spark/pull/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
GitHub user rtreffer opened a pull request: https://github.com/apache/spark/pull/ [SPARK-6888][SQL] Make the jdbc driver handling user-definable Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect) and allow developers to change the dialects on the fly (for new JDBCRRDs only). Some types (like an unsigned 64bit number) can be trivially mapped to java. The status quo is that the RRD will fail to load. This patch makes it possible to overwrite the type mapping to read e.g. 64Bit numbers as strings and handle them afterwards in software. JDBCSuite has an example that maps all types to String, which should always work (at the cost of extra code afterwards). As a side effect it should now be possible to develop simple dialects out-of-tree and even with spark-shell. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rtreffer/spark jdbc-dialects Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes # commit 3e98f746a90a4764647fb9f4b17a416cab3f868a Author: Rene Treffer Date: 2015-04-17T09:06:55Z [SPARK-6888] Make the jdbc driver handling user-definable Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect) and allow developers to change the dialects on the fly (for new JDBCRRDs only). Some types (like an unsigned 64bit number) can be trivially mapped to java. The status quo is that the RRD will fail to load. This patch makes it possible to overwrite the type mapping to read e.g. 64Bit numbers as strings and handle them afterwards in software. JDBCSuite has an example that maps all types to String, which should always work (at the cost of extra code afterwards). As a side effect it should now be possible to develop simple dialects out-of-tree and even with spark-shell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/5498#issuecomment-93677337 @marmbrus thank you, I'll fix those issues and open a new one when done. Regarding naming/api: It is quite common that there is one class per sql/jdbc dialect. Often called that way (e.g. MySQLDialect on hibernate). I've found quite some projects that use the same naming (via github search). Anyway, in this case I'd like to match those namings and add a default implementation per method (returning the neutral element). On the other hand it's currently just doing type mapping. So JDBCTypeMapping would be a very valid name, too. It would restrict the use case more (can be good or bad). I guess you know better what would suite spark :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/5498#issuecomment-93571593 I still have to write tests for AggregatedQuirks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/5498#issuecomment-93272238 @liancheng thank you, will updaste the patch. Just one question: Should I sqash/amend the fixes or should I add a second commit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Export driver quirks
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/5498#issuecomment-92521352 Added a ticket: https://issues.apache.org/jira/browse/SPARK-6888 Will add that to the commit after some sleep .zZzZzZ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Export driver quirks
GitHub user rtreffer opened a pull request: https://github.com/apache/spark/pull/5498 Export driver quirks Make it possible to (temporary) overwrite the driver quirks. This can be used to overcome problems with specific schemas or to add new jdbc driver support on the fly. A very simple implementation to dump the loading can be done like this (spark-shell) ``` class DumpQuirk extends org.apache.spark.sql.jdbc.DriverQuirks { def canHandle(url : String): Boolean = true def getCatalystType(sqlType: Int, typeName: String, size: Int, md: org.apache.spark.sql.types.MetadataBuilder): org.apache.spark.sql.types.DataType = { println("" + (sqlType, typeName, size, md)) null } def getJDBCType(dt: org.apache.spark.sql.types.DataType): (String, Option[Int]) = (null, None) } org.apache.spark.sql.jdbc.DriverQuirks.registerQuirks(new DumpQuirk()) ``` Not that this pull request is against 1.3 - I could not create a distribution with the current master. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rtreffer/spark export-driver-quirks Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5498.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5498 commit dca9372fc32e3222afc25e11394b1e620c115efa Author: Rene Treffer Date: 2015-04-13T21:38:59Z Export driver quirks Make it possible to (temporary) overwrite the driver quirks. This can be used to overcome problems with specific schemas or to add new jdbc driver support on the fly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org