from:"rtreffer"

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34875478
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

Sounds like a changelog entry to me.
(We can now write parquet files for Decimal >18, so please check 
compatibility if you use spark parquet files elsewhere)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34872908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

I would prefer the way you currently wrote it.
I don't see a point in keeping a "store it in a way that an older version 
can read" flag. You should always try a new version and then use it for real 
storage. And reading files written by old spark version will always be possible.

PS: solved the test thing. It looks like spark sbt somehow managed to use a 
local 2.9.6 scalac 0.o


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34870594
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

Ah, yes, you removed the < 8 check too. But shouldn't 
followParquetFormatSpec = false generate compatible files?

I'm getting a compiler assertion on test compile and I tried cleaning :-S 
Anyway, must be some sort of local **.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34868189
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

This should fail if due to followParquetFormatSpec = false.

I was trying to test "withSQLConf" but couldn't get it to work: 
https://github.com/apache/spark/pull/6796/files#diff-82fab6131b7092c5faa4064fd04c3d72R135

(I have to find out why I can't run tests locally, ./build/sbt sql/test 
fails with a compiler assertion?!?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-07-07 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-119301151
  
The writeDecimal method is rather ugly, and the write path needs to know if 
we follow parquet style or not as this implies a different encoding (addInteger 
/ addLong).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-07-07 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-119170768
  
Hi @liancheng,

I'm rebasing on you PR right now. I can work for ~1-2h / day on this PR so 
feel free to take over the PR if this blocks anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-29 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-116769197
  
Hi @liancheng, thank you for the thorough review, will push a reworked 
version soon. Everything sounds reasonable :-)

With "private" Settings I meant that I can't change the setting in the 
shell because it's marked as "isPublic = false" in 
https://github.com/liancheng/spark/blob/2a2062d3f530ecd26e75b306aee42761d67d8724/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L273

I'm not sure if that's intended.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-28 Thread rtreffer

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33420742
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala
 ---
@@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter(
   .length(minBytesForPrecision(precision))
   .named(field.name)
 
-  case dec @ DecimalType() if !followParquetFormatSpec =>
-throw new AnalysisException(
-  s"Data type $dec is not supported. " +
-s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} is 
set to false," +
-"decimal precision and scale must be specified, " +
-"and precision must be less than or equal to 18.")
-
   // =
   // Decimals (follow Parquet format spec)
   // =
 
-  // Uses INT32 for 1 <= precision <= 9
+  // Uses INT32 for 4 byte encodings / precision <= 9
   case DecimalType.Fixed(precision, scale)
-if precision <= maxPrecisionForBytes(4) && followParquetFormatSpec 
=>
+if followParquetFormatSpec && maxPrecisionForBytes(3) < precision 
&& 
+  precision <= maxPrecisionForBytes(4) =>
--- End diff --

We had a debate about using the most compact storage type if possible.

As such INT32 looses compared to a 3 byte fixed length array.

Am 28. Juni 2015 10:59:15 MESZ, schrieb Cheng Lian 
:
>>case DecimalType.Fixed(precision, scale)
>> -if precision <= maxPrecisionForBytes(4) &&
>followParquetFormatSpec =>
>> +if followParquetFormatSpec && maxPrecisionForBytes(3) <
>precision && 
>> +  precision <= maxPrecisionForBytes(4) =>
>
>Why do we want `maxPrecisionForBytes(3) < precision` here? Did I miss
>something?
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33420647

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-28 Thread rtreffer

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33420719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala
 ---
@@ -169,11 +169,12 @@ private[parquet] class CatalystSchemaConverter(
 }
 
   case INT96 =>
-CatalystSchemaConverter.analysisRequire(
-  assumeInt96IsTimestamp,
-  "INT96 is not supported unless it's interpreted as timestamp. " +
-s"Please try to set ${SQLConf.PARQUET_INT96_AS_TIMESTAMP.key} 
to true.")
-TimestampType
+field.getOriginalType match {
+  case DECIMAL => makeDecimalType(maxPrecisionForBytes(12))
+  case _ if assumeInt96IsTimestamp => TimestampType
+  case null => makeDecimalType(maxPrecisionForBytes(12))
+  case _ => illegalType()
+}
--- End diff --

Didn't know about the deprecation, will drop it.

Am 28. Juni 2015 10:56:00 MESZ, schrieb Cheng Lian 
:
>> @@ -169,11 +169,12 @@ private[parquet] class CatalystSchemaConverter(
>>  }
>>  
>>case INT96 =>
>> -CatalystSchemaConverter.analysisRequire(
>> -  assumeInt96IsTimestamp,
>> -  "INT96 is not supported unless it's interpreted as
>timestamp. " +
>> -s"Please try to set
>${SQLConf.PARQUET_INT96_AS_TIMESTAMP.key} to true.")
>> -TimestampType
>> +field.getOriginalType match {
>> +  case DECIMAL => makeDecimalType(maxPrecisionForBytes(12))
>> +  case _ if assumeInt96IsTimestamp => TimestampType
>> +  case null => makeDecimalType(maxPrecisionForBytes(12))
>> +  case _ => illegalType()
>> +}
>
>`INT96` is only used for nanosecond timestamp types for historical
>reasons, and is to be deprecated. Let's not use it for decimals.
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33420609

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-28 Thread rtreffer

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33420714
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala
 ---
@@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter(
   .length(minBytesForPrecision(precision))
   .named(field.name)
 
-  case dec @ DecimalType() if !followParquetFormatSpec =>
-throw new AnalysisException(
-  s"Data type $dec is not supported. " +
-s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} is 
set to false," +
-"decimal precision and scale must be specified, " +
-"and precision must be less than or equal to 18.")
-
--- End diff --

You said it should usevthe hive default of (10,0) - or did I misinterpret 
that?

Am 28. Juni 2015 10:53:00 MESZ, schrieb Cheng Lian 
:
>> @@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter(
>>.length(minBytesForPrecision(precision))
>>.named(field.name)
>>  
>> -  case dec @ DecimalType() if !followParquetFormatSpec =>
>> -throw new AnalysisException(
>> -  s"Data type $dec is not supported. " +
>> -s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key}
>is set to false," +
>> -"decimal precision and scale must be specified, " +
>> -"and precision must be less than or equal to 18.")
>> -
>
>We still need this branch to handle the case where precision
>information is missing.
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33420570

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-26 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-115620818
  
@liancheng it starts to work (compiles and minimal initial test worked, no 
guarantees). I think there are some points that need feedback
- How is the compatibility mode intended to work? Settings are currently 
private, but I'd like to store Decimal(19), so is lifting the 18 limit correct 
for compatibility mode?
- INT32/INT64 are only used when the byte length matches the byte length 
for the precision. FIXED_LEN_BYTE_ARRAY will thus e.g. be used to store 6 byte 
values
- FIXED_LEN_BYTE_ARRAY means I'll have to create an array of the correct 
size. I've increased the scratch_bytes. Not very happy about the code path, do 
you have better ideas?
- BYTES_FOR_PRECISION needs to handle any precision. I've reworked that 
code. Again, suggestions welcome

The patch is now way smaller and less intrusive. Looks like the refactoring 
was well worth the effort!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-25 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-11533
  
Currently reworking the patch.

Here is the warning about the tuple match
```
[warn] 
/home/rtreffer/work/spark-master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala:334:
 object Fixed expects 2 patterns to hold (Int, Int) but crushing into 2-tuple 
to fit single pattern (SI-6675)
```
According to the ticket it's a deprecation warning.
https://issues.scala-lang.org/browse/SI-6675

Nothing urgent, but I think it should be fixed at some point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-114835713

@liancheng I'll rebase on your branch, I really like the way you cleaned up
toPrimitiveDataType by using a fluent Types interface. This will make this
patch way easier.

Talking about testing/compatibility/interoperability, I have added a
hive-generated parquet file that I'd like to turn into a test case:

https://github.com/rtreffer/spark/tree/spark-4176-store-large-decimal-in-parquet/sql/core/src/test/resources/hive-decimal-parquet

There are some parquet files attached to tickets in jira, too.
Do you plan to convert those into tests?

Regarding FIXED_LENGTH_BYTE_ARRAY The overhead would decreases compared
to size. BINARY overhead would be <10% from ~DECIMAL(100) and <25% from
~DECIAL(40) (pre-compression). I'd expect DECIMAL(40) to use the full precision
only from time to time. But yeah, I've overlooked the 4 byte overhead at
https://github.com/Parquet/parquet-format/blob/master/Encodings.md and assumed
it would be less, FIXED_LENGTH_BYTE_ARRAY should be good for now (until s.o.
complains).

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33133591
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
 case TimestampType => 
Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
 case DecimalType.Fixed(precision, scale) if precision <= 18 =>
-  // TODO: for now, our writer only supports decimals that fit in a 
Long
   Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
 Some(ParquetOriginalType.DECIMAL),
 Some(new DecimalMetadata(precision, scale)),
 Some(BYTES_FOR_PRECISION(precision
+case DecimalType.Fixed(precision, scale) =>
+  Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY,
--- End diff --

Under the assumption that all values will use the full length, yes.

But at some point the overhead of the length is low compared to the 
overhead if someone specifies just the upper bound of values.
I have to check if it really uses 4 bytes for BINARY. I'd then raise the 
threshold to ~40 bytes length. (meaning <=10% worst case overhead before 
compression)

It won't simplify the decoding/writing though, because the <=18 case  is 
used for long decoding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33129732
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala 
---
@@ -331,7 +331,7 @@ private[sql] class JDBCRDD(
   case BooleanType => BooleanConversion
   case DateType => DateConversion
   case DecimalType.Unlimited => DecimalConversion(None)
-  case DecimalType.Fixed(d) => DecimalConversion(Some(d))
+  case DecimalType.Fixed(d, s) => DecimalConversion(Some((d, s)))
--- End diff --

As said it was only about a warning, not about correctness. I'll drop this 
change on the next version, it draws too much attention and is not needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33124241
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala 
---
@@ -369,9 +371,6 @@ private[parquet] class MutableRowWriteSupport extends 
RowWriteSupport {
   case DateType => writer.addInteger(record.getInt(index))
   case TimestampType => writeTimestamp(record.getLong(index))
   case d: DecimalType =>
-if (d.precisionInfo == None || d.precisionInfo.get.precision > 18) 
{
-  sys.error(s"Unsupported datatype $d, cannot write to consumer")
-}
--- End diff --

Overseen, bug. We can't serialize without that info, parquet requires it 
(and mixed scale would be complicated).

PS: do you know if there is any interest on allowing mixed Decimal in 
parquet?

Am 24. Juni 2015 09:36:11 MESZ, schrieb Cheng Lian 
:
>> @@ -369,9 +371,6 @@ private[parquet] class MutableRowWriteSupport
>extends RowWriteSupport {
>>case DateType => writer.addInteger(record.getInt(index))
>>case TimestampType => writeTimestamp(record.getLong(index))
>>case d: DecimalType =>
>> -if (d.precisionInfo == None || d.precisionInfo.get.precision
>> 18) {
>> -  sys.error(s"Unsupported datatype $d, cannot write to
>consumer")
>> -}
>
>Don't we need to consider the case where `d.precisionInfo == None` now?
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33123740

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-21 Thread rtreffer

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32892827
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -289,7 +295,20 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
   name: String,
   nullable: Boolean = true,
   inArray: Boolean = false,
+  parquetSchema: Option[ParquetType] = None,
   toThriftSchemaNames: Boolean = false): ParquetType = {
+
+val parquetElementTypeBySchema =
--- End diff --

It also performs a type check / conversion. That's why I've removed it. It 
would look like this

```
val parquetElementTypeBySchema =
  
parquetSchema.filter(_.isInstanceOf[ParquetGroupType]).filter(_.containsField(name)).map(_.getType(name))
```

I would settle on collect, does that look ok?

```
  val parquetElementTypeBySchema = parquetSchema.collect {
case gType : ParquetGroupType if (gType.containsField(name)) => 
gType.getType(name)
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-21 Thread rtreffer

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32891515
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
 case TimestampType => 
Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
 case DecimalType.Fixed(precision, scale) if precision <= 18 =>
-  // TODO: for now, our writer only supports decimals that fit in a 
Long
   Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
--- End diff --

What would we gain by encoding it that way?
We use a minimal length fixed byte array which should provide a similar 
compact encoding. (DECIMAL(9) should end up as 4 bytes, and smaller decimal 
values should take even less space)

Decoding is a different story, though.

PS: I was focusing on DECIMAL with precision >=19. Shouldn't small decimal 
handling be a new ticket?

Am 21. Juni 2015 02:41:44 MESZ, schrieb Davies Liu 
:
>> @@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter
>extends Logging {
>>  case LongType =>
>Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
>>  case TimestampType =>
>Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
>>  case DecimalType.Fixed(precision, scale) if precision <= 18 =>
>> -  // TODO: for now, our writer only supports decimals that fit
>in a Long
>>   
>Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
>
>We should int32, int64 if possible, see

>https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>```
>DECIMAL can be used to annotate the following types:
>
>int32: for 1 <= precision <= 9
>int64: for 1 <= precision <= 18; precision <= 10 will produce a warning
>fixed_len_byte_array: precision is limited by the array size. Length n
>can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
>binary: precision is not limited, but is required. The minimum number
>of bytes to store the unscaled value should be used.
>```
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r32889441

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-21 Thread rtreffer

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32891379
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -289,7 +295,20 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
   name: String,
   nullable: Boolean = true,
   inArray: Boolean = false,
+  parquetSchema: Option[ParquetType] = None,
   toThriftSchemaNames: Boolean = false): ParquetType = {
+
+val parquetElementTypeBySchema =
--- End diff --

Ah, yes, an early version had that, I somehow moved to this verbose code O.o

Thanks

Am 21. Juni 2015 02:33:15 MESZ, schrieb Davies Liu 
:
>>toThriftSchemaNames: Boolean = false): ParquetType = {
>> +
>> +val parquetElementTypeBySchema =
>
>These could be
>```
>parquetSchema.filter(_.containsField(name)).map(_.getType(name))
>```
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r32889401

-- 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-20 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113834732
  
I've pushed a very early version of a fix. (Literally early, it's nearly 
1:00 am. And I'd expect the test build to fail, I'll fix the outstanding issues 
later today)

PS: Loading the hive parquet works now, but I've not yet tested much more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-20 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113814084
  
The problematic line is

https://github.com/apache/spark/blob/v1.4.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L105
(calling ParquetTypesConverter.convertFromAttributes)
This causes the schema to be normalized, but that does not work for decimal.

```
message hive_schema {
  optional int32 id;
  optional fixed_len_byte_array(13) value (DECIMAL(30,0));
}
```
is replaced by
```
message root {
  optional int32 id;
  optional binary value (DECIMAL(30,0));
}
```

I removed that line and the loading of the data works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-20 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113762279
  
Ok, I think I'm slowly getting down to the cause
strictTypeChecking on 
https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/ColumnIOFactory.java#L100
 creates the exception

The relevant class for the job setup is ParquetTableScan (doExecute).

I'm not yet sure if this can be fixed on the job setup or on the receiver 
side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113629961
  
I've pushed the hive generated parquet file and I'll call it a day.

I think I'll have to relax the validation of column types for DECIMAL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113628860
  
Ok, it looks like I can't open hive generated parquet files, but it looks 
more like a type error.

```
scala> val hive = sqlContext.load("/home/rtreffer/work/hadoop/hive-parquet")
warning: there was one deprecation warning; re-run with -deprecation for 
details
15/06/19 22:03:26 INFO ParquetFileReader: Initiating action with 
parallelism: 5
hive: org.apache.spark.sql.DataFrame = [id: int, value: decimal(30,0)]

scala> hive.collect.foreach(println)
15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 
localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_9_piece0 on 
localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(130208) called with 
curMem=0, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 127.2 KB, free 265.0 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(14082) called with 
curMem=130208, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10_piece0 stored as 
bytes in memory (estimated size 13.8 KB, free 265.0 MB)
15/06/19 22:03:35 INFO BlockManagerInfo: Added broadcast_10_piece0 in 
memory on localhost:42189 (size: 13.8 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO SparkContext: Created broadcast 10 from collect at 
:38
15/06/19 22:03:35 INFO SparkContext: Starting job: collect at :38
15/06/19 22:03:35 INFO DAGScheduler: Got job 9 (collect at :38) 
with 1 output partitions (allowLocal=false)
15/06/19 22:03:35 INFO DAGScheduler: Final stage: ResultStage 11(collect at 
:38)
15/06/19 22:03:35 INFO DAGScheduler: Parents of final stage: List()
15/06/19 22:03:35 INFO DAGScheduler: Missing parents: List()
15/06/19 22:03:35 INFO DAGScheduler: Submitting ResultStage 11 
(MapPartitionsRDD[45] at collect at :38), which has no missing parents
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(5568) called with 
curMem=144290, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11 stored as values in 
memory (estimated size 5.4 KB, free 265.0 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(2964) called with 
curMem=149858, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11_piece0 stored as 
bytes in memory (estimated size 2.9 KB, free 265.0 MB)
15/06/19 22:03:36 INFO BlockManagerInfo: Added broadcast_11_piece0 in 
memory on localhost:42189 (size: 2.9 KB, free: 265.1 MB)
15/06/19 22:03:36 INFO SparkContext: Created broadcast 11 from broadcast at 
DAGScheduler.scala:893
15/06/19 22:03:36 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 11 (MapPartitionsRDD[45] at collect at :38)
15/06/19 22:03:36 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks
15/06/19 22:03:36 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 
36, localhost, PROCESS_LOCAL, 1471 bytes)
15/06/19 22:03:36 INFO Executor: Running task 0.0 in stage 11.0 (TID 36)
15/06/19 22:03:36 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1: Input 
split: ParquetInputSplit{part: 
file:/home/rtreffer/work/hadoop/hive-parquet/00_0 start: 0 end: 874 length: 
874 hosts: []}
15/06/19 22:03:36 INFO InternalParquetRecordReader: RecordReader 
initialized will read a total of 33 records.
15/06/19 22:03:36 INFO InternalParquetRecordReader: at row 0. reading next 
block
15/06/19 22:03:36 INFO InternalParquetRecordReader: block read in memory in 
0 ms. row count = 33
15/06/19 22:03:36 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 
36)
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file file:/home/rtreffer/work/hadoop/hive-parquet/00_0
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.sca

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113622541
  
(hive 1.2.0 and hadoop 2.7.0 without hdfs or a cluster)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113621866
  
Just did a test with hive, I can declare a parquet file written with spark 
as

{code}
CREATE EXTERNAL TABLE ptest (id INT, value DECIMAL(30,0))
STORED AS PARQUET
LOCATION "file:///home/rtreffer/work/hadoop/parquet";
{code}

and it *does* work. I'm now trying to test the opposite direction plus a 
test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-18 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113277495
  
Yes, manually. I could add the file I was writing afterwards, sounds like a 
good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-18 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113276153
  
I was thinking about this: Create a small parquet file with spark, load it 
with hive, copy it to a new parquet table with hive, read that with spark.

If that matches the input -> win. Otherwise -> some more work.

PS: SPARK-8342 / SPARK-8359 were only problems during my initial tests. 
It's a bit harder to test read/write of Decimal if the implementation has bugs. 
So those are unrelated to this patch, but they might reduce the usefulness of 
this patch (you can't do reliable math in the ranges you could now load/save)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-18 Thread rtreffer

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113274389
  
The only reason for the WIP is that I have not yet cross-tested the 
interoperability with e.g. hive. It follows the spec, but I'd like to test it 
(or have s.o. verify this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32371677
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala 
---
@@ -331,7 +331,7 @@ private[sql] class JDBCRDD(
   case BooleanType => BooleanConversion
   case DateType => DateConversion
   case DecimalType.Unlimited => DecimalConversion(None)
-  case DecimalType.Fixed(d) => DecimalConversion(Some(d))
+  case DecimalType.Fixed(d, s) => DecimalConversion(Some((d, s)))
--- End diff --

It caused a warning on my system. So I thought it would be better to make 
it explicit.

I could drop that line from this patch, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-111713135
  
Note: I came across https://issues.apache.org/jira/browse/SPARK-8342 while 
testing, it seems like Decimal math is unsafe at the moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

GitHub user rtreffer opened a pull request:

https://github.com/apache/spark/pull/6796

[SPARK-4176][WIP] Support decimal types with precision > 18 in parquet

This is my current WIP on SPARK-4176. It should be compatible with other 
implementations of parquet.


https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal
```
For byte arrays, binary and fixed, the unscaled number must be encoded as
two's complement using big-endian byte order
(the most significant byte is the zeroth element)
```
This is the default encoding on bigint. It should thus be compatible with 
other implementations, although it would be great if s.o. could test this.

I've tested this locally with powers of 2 up to 2^200 in the spark shell, 
without errors but

Code I've used for (local) testing (on spark shell):
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.io.File

def decimalList(n : Int) = {
  require(n > 0)
  val one = Decimal(1)
  val two = Decimal(2)
  one.changePrecision(n,0)
  two.changePrecision(n,0)
  val list = scala.collection.mutable.ArrayBuffer(one)
  while (list.length < n) {
val v = list(list.length - 1)
list += Decimal(v.toJavaBigDecimal.multiply(two.toJavaBigDecimal),n,0)
  }
  list.toList
}

def decimalRows(l : List[Decimal]) = l.zipWithIndex.map(e => Row(e._2,e._1))

def decimalRowRdd(l : List[Decimal]) = sc.parallelize(decimalRows(l))

def df(n : Int) = {
  val data = decimalList(n)
  val len = data.lastOption.get.toString.length
  val schema = StructType(Array(
StructField("id", IntegerType, true),
StructField("value", DecimalType(len,0), true)))
  sqlContext.createDataFrame(decimalRowRdd(data), schema)
}

def delete(filename : String) : Unit = {
  val f = new File(filename)
  if (!f.exists) return
  if (f.isDirectory) f.listFiles.foreach(_.delete)
  f.delete
}

def test(n : Int) = {
  val src = df(n)
  delete("/tmp/typetest")
  src.save("/tmp/typetest")
  val copy = sqlContext.load("/tmp/typetest")
  
src.collect().sortBy(_.getInt(0)).zip(copy.collect().sortBy(_.getInt(0))).foreach(e
 => {
if (e._1 != e._2) println(s"${e._1} != ${e._2}")
  })
  delete("/tmp/typetest")
  println(s"Tested 1..2^${n - 1}")
}
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rtreffer/spark 
spark-4176-store-large-decimal-in-parquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6796


commit 32630dfa440f618021d5d0a1db86cb7a3ea5e2f4
Author: Rene Treffer 
Date:   2015-06-13T12:54:42Z

[SPARK-4176] Support decimal types with precision > 18




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...