[GitHub] spark pull request: [SPARK-8848] [SQL] [WIP] Refactors Parquet wri...

2015-07-31 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/7679#issuecomment-126616847
  
@liancheng I can transfer tables from mysql -> parquet, including unsigned 
bigint -> DECIMAL(20) (YEAH!).
I couldn't find any problems by reading the code (yet).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

2015-07-27 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/7455#issuecomment-125324116
  
@liancheng thank you for your time reviewing and implementing this! :bow: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-07-18 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-122529489
  
@liancheng sure, I just wasn't sure if it should be closed :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-07-18 Thread rtreffer
Github user rtreffer closed the pull request at:

https://github.com/apache/spark/pull/6796


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

2015-07-17 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34875478
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

Sounds like a changelog entry to me.
(We can now write parquet files for Decimal >18, so please check 
compatibility if you use spark parquet files elsewhere)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

2015-07-17 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34872908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

I would prefer the way you currently wrote it.
I don't see a point in keeping a "store it in a way that an older version 
can read" flag. You should always try a new version and then use it for real 
storage. And reading files written by old spark version will always be possible.

PS: solved the test thing. It looks like spark sbt somehow managed to use a 
local 2.9.6 scalac 0.o


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

2015-07-17 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34870594
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

Ah, yes, you removed the < 8 check too. But shouldn't 
followParquetFormatSpec = false generate compatible files?

I'm getting a compiler assertion on test compile and I tried cleaning :-S 
Anyway, must be some sort of local **.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] [SQL] Supports decimal types with...

2015-07-17 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/7455#discussion_r34868189
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala ---
@@ -107,7 +107,7 @@ class ParquetIOSuiteBase extends QueryTest with 
ParquetTest {
 // Parquet doesn't allow column names with spaces, have to add an 
alias here
 .select($"_1" cast decimal as "dec")
 
-for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17))) {
+for ((precision, scale) <- Seq((5, 2), (1, 0), (1, 1), (18, 10), (18, 
17), (19, 0), (38, 37))) {
--- End diff --

This should fail if due to followParquetFormatSpec = false.

I was trying to test "withSQLConf" but couldn't get it to work: 
https://github.com/apache/spark/pull/6796/files#diff-82fab6131b7092c5faa4064fd04c3d72R135

(I have to find out why I can't run tests locally, ./build/sbt sql/test 
fails with a compiler assertion?!?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-07-07 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-119301151
  
The writeDecimal method is rather ugly, and the write path needs to know if 
we follow parquet style or not as this implies a different encoding (addInteger 
/ addLong).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-07-07 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-119170768
  
Hi @liancheng,

I'm rebasing on you PR right now. I can work for ~1-2h / day on this PR so 
feel free to take over the PR if this blocks anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-29 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-116769197
  
Hi @liancheng, thank you for the thorough review, will push a reworked 
version soon. Everything sounds reasonable :-)

With "private" Settings I meant that I can't change the setting in the 
shell because it's marked as "isPublic = false" in 
https://github.com/liancheng/spark/blob/2a2062d3f530ecd26e75b306aee42761d67d8724/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L273

I'm not sure if that's intended.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-28 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33420742
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala
 ---
@@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter(
   .length(minBytesForPrecision(precision))
   .named(field.name)
 
-  case dec @ DecimalType() if !followParquetFormatSpec =>
-throw new AnalysisException(
-  s"Data type $dec is not supported. " +
-s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} is 
set to false," +
-"decimal precision and scale must be specified, " +
-"and precision must be less than or equal to 18.")
-
   // =
   // Decimals (follow Parquet format spec)
   // =
 
-  // Uses INT32 for 1 <= precision <= 9
+  // Uses INT32 for 4 byte encodings / precision <= 9
   case DecimalType.Fixed(precision, scale)
-if precision <= maxPrecisionForBytes(4) && followParquetFormatSpec 
=>
+if followParquetFormatSpec && maxPrecisionForBytes(3) < precision 
&& 
+  precision <= maxPrecisionForBytes(4) =>
--- End diff --

We had a debate about using the most compact storage type if possible.

As such INT32 looses compared to a 3 byte fixed length array.

Am 28. Juni 2015 10:59:15 MESZ, schrieb Cheng Lian 
:
>>case DecimalType.Fixed(precision, scale)
>> -if precision <= maxPrecisionForBytes(4) &&
>followParquetFormatSpec =>
>> +if followParquetFormatSpec && maxPrecisionForBytes(3) <
>precision && 
>> +  precision <= maxPrecisionForBytes(4) =>
>
>Why do we want `maxPrecisionForBytes(3) < precision` here? Did I miss
>something?
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33420647

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-28 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33420719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala
 ---
@@ -169,11 +169,12 @@ private[parquet] class CatalystSchemaConverter(
 }
 
   case INT96 =>
-CatalystSchemaConverter.analysisRequire(
-  assumeInt96IsTimestamp,
-  "INT96 is not supported unless it's interpreted as timestamp. " +
-s"Please try to set ${SQLConf.PARQUET_INT96_AS_TIMESTAMP.key} 
to true.")
-TimestampType
+field.getOriginalType match {
+  case DECIMAL => makeDecimalType(maxPrecisionForBytes(12))
+  case _ if assumeInt96IsTimestamp => TimestampType
+  case null => makeDecimalType(maxPrecisionForBytes(12))
+  case _ => illegalType()
+}
--- End diff --

Didn't know about the deprecation, will drop it.

Am 28. Juni 2015 10:56:00 MESZ, schrieb Cheng Lian 
:
>> @@ -169,11 +169,12 @@ private[parquet] class CatalystSchemaConverter(
>>  }
>>  
>>case INT96 =>
>> -CatalystSchemaConverter.analysisRequire(
>> -  assumeInt96IsTimestamp,
>> -  "INT96 is not supported unless it's interpreted as
>timestamp. " +
>> -s"Please try to set
>${SQLConf.PARQUET_INT96_AS_TIMESTAMP.key} to true.")
>> -TimestampType
>> +field.getOriginalType match {
>> +  case DECIMAL => makeDecimalType(maxPrecisionForBytes(12))
>> +  case _ if assumeInt96IsTimestamp => TimestampType
>> +  case null => makeDecimalType(maxPrecisionForBytes(12))
>> +  case _ => illegalType()
>> +}
>
>`INT96` is only used for nanosecond timestamp types for historical
>reasons, and is to be deprecated. Let's not use it for decimals.
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33420609

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-28 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33420714
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala
 ---
@@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter(
   .length(minBytesForPrecision(precision))
   .named(field.name)
 
-  case dec @ DecimalType() if !followParquetFormatSpec =>
-throw new AnalysisException(
-  s"Data type $dec is not supported. " +
-s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key} is 
set to false," +
-"decimal precision and scale must be specified, " +
-"and precision must be less than or equal to 18.")
-
--- End diff --

You said it should usevthe hive default of (10,0) - or did I misinterpret 
that?

Am 28. Juni 2015 10:53:00 MESZ, schrieb Cheng Lian 
:
>> @@ -383,20 +386,14 @@ private[parquet] class CatalystSchemaConverter(
>>.length(minBytesForPrecision(precision))
>>.named(field.name)
>>  
>> -  case dec @ DecimalType() if !followParquetFormatSpec =>
>> -throw new AnalysisException(
>> -  s"Data type $dec is not supported. " +
>> -s"When ${SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC.key}
>is set to false," +
>> -"decimal precision and scale must be specified, " +
>> -"and precision must be less than or equal to 18.")
>> -
>
>We still need this branch to handle the case where precision
>information is missing.
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33420570

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-26 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-115620818
  
@liancheng it starts to work (compiles and minimal initial test worked, no 
guarantees). I think there are some points that need feedback
- How is the compatibility mode intended to work? Settings are currently 
private, but I'd like to store Decimal(19), so is lifting the 18 limit correct 
for compatibility mode?
- INT32/INT64 are only used when the byte length matches the byte length 
for the precision. FIXED_LEN_BYTE_ARRAY will thus e.g. be used to store 6 byte 
values
- FIXED_LEN_BYTE_ARRAY means I'll have to create an array of the correct 
size. I've increased the scratch_bytes. Not very happy about the code path, do 
you have better ideas?
- BYTES_FOR_PRECISION needs to handle any precision. I've reworked that 
code. Again, suggestions welcome

The patch is now way smaller and less intrusive. Looks like the refactoring 
was well worth the effort!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-25 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-11533
  
Currently reworking the patch.

Here is the warning about the tuple match
```
[warn] 
/home/rtreffer/work/spark-master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala:334:
 object Fixed expects 2 patterns to hold (Int, Int) but crushing into 2-tuple 
to fit single pattern (SI-6675)
```
According to the ticket it's a deprecation warning.
https://issues.scala-lang.org/browse/SI-6675

Nothing urgent, but I think it should be fixed at some point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-24 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-114835713
  
@liancheng I'll rebase on your branch, I really like the way you cleaned up 
toPrimitiveDataType by using a fluent Types interface. This will make this 
patch way easier.

Talking about testing/compatibility/interoperability, I have added a 
hive-generated parquet file that I'd like to turn into a test case:

https://github.com/rtreffer/spark/tree/spark-4176-store-large-decimal-in-parquet/sql/core/src/test/resources/hive-decimal-parquet
 
There are some parquet files attached to tickets in jira, too.
Do you plan to convert those into tests?

Regarding FIXED_LENGTH_BYTE_ARRAY The overhead would decreases compared 
to size. BINARY overhead would be <10% from ~DECIMAL(100) and <25% from 
~DECIAL(40) (pre-compression). I'd expect DECIMAL(40) to use the full precision 
only from time to time. But yeah, I've overlooked the 4 byte overhead at 
https://github.com/Parquet/parquet-format/blob/master/Encodings.md and assumed 
it would be less, FIXED_LENGTH_BYTE_ARRAY should be good for now (until s.o. 
complains).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-24 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33133591
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
 case TimestampType => 
Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
 case DecimalType.Fixed(precision, scale) if precision <= 18 =>
-  // TODO: for now, our writer only supports decimals that fit in a 
Long
   Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
 Some(ParquetOriginalType.DECIMAL),
 Some(new DecimalMetadata(precision, scale)),
 Some(BYTES_FOR_PRECISION(precision
+case DecimalType.Fixed(precision, scale) =>
+  Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY,
--- End diff --

Under the assumption that all values will use the full length, yes.

But at some point the overhead of the length is low compared to the 
overhead if someone specifies just the upper bound of values.
I have to check if it really uses 4 bytes for BINARY. I'd then raise the 
threshold to ~40 bytes length. (meaning <=10% worst case overhead before 
compression)

It won't simplify the decoding/writing though, because the <=18 case  is 
used for long decoding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-24 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33129732
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala 
---
@@ -331,7 +331,7 @@ private[sql] class JDBCRDD(
   case BooleanType => BooleanConversion
   case DateType => DateConversion
   case DecimalType.Unlimited => DecimalConversion(None)
-  case DecimalType.Fixed(d) => DecimalConversion(Some(d))
+  case DecimalType.Fixed(d, s) => DecimalConversion(Some((d, s)))
--- End diff --

As said it was only about a warning, not about correctness. I'll drop this 
change on the next version, it draws too much attention and is not needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-24 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r33124241
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala 
---
@@ -369,9 +371,6 @@ private[parquet] class MutableRowWriteSupport extends 
RowWriteSupport {
   case DateType => writer.addInteger(record.getInt(index))
   case TimestampType => writeTimestamp(record.getLong(index))
   case d: DecimalType =>
-if (d.precisionInfo == None || d.precisionInfo.get.precision > 18) 
{
-  sys.error(s"Unsupported datatype $d, cannot write to consumer")
-}
--- End diff --

Overseen, bug. We can't serialize without that info, parquet requires it 
(and mixed scale would be complicated).

PS: do you know if there is any interest on allowing mixed Decimal in 
parquet?

Am 24. Juni 2015 09:36:11 MESZ, schrieb Cheng Lian 
:
>> @@ -369,9 +371,6 @@ private[parquet] class MutableRowWriteSupport
>extends RowWriteSupport {
>>case DateType => writer.addInteger(record.getInt(index))
>>case TimestampType => writeTimestamp(record.getLong(index))
>>case d: DecimalType =>
>> -if (d.precisionInfo == None || d.precisionInfo.get.precision
>> 18) {
>> -  sys.error(s"Unsupported datatype $d, cannot write to
>consumer")
>> -}
>
>Don't we need to consider the case where `d.precisionInfo == None` now?
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r33123740

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-21 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32892827
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -289,7 +295,20 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
   name: String,
   nullable: Boolean = true,
   inArray: Boolean = false,
+  parquetSchema: Option[ParquetType] = None,
   toThriftSchemaNames: Boolean = false): ParquetType = {
+
+val parquetElementTypeBySchema =
--- End diff --

It also performs a type check / conversion. That's why I've removed it. It 
would look like this

```
val parquetElementTypeBySchema =
  
parquetSchema.filter(_.isInstanceOf[ParquetGroupType]).filter(_.containsField(name)).map(_.getType(name))
```

I would settle on collect, does that look ok?

```
  val parquetElementTypeBySchema = parquetSchema.collect {
case gType : ParquetGroupType if (gType.containsField(name)) => 
gType.getType(name)
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-21 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32891515
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
 case TimestampType => 
Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
 case DecimalType.Fixed(precision, scale) if precision <= 18 =>
-  // TODO: for now, our writer only supports decimals that fit in a 
Long
   Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
--- End diff --

What would we gain by encoding it that way?
We use a minimal length fixed byte array which should provide a similar 
compact encoding. (DECIMAL(9) should end up as 4 bytes, and smaller decimal 
values should take even less space)

Decoding is a different story, though.

PS: I was focusing on DECIMAL with precision >=19. Shouldn't small decimal 
handling be a new ticket?

Am 21. Juni 2015 02:41:44 MESZ, schrieb Davies Liu 
:
>> @@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter
>extends Logging {
>>  case LongType =>
>Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
>>  case TimestampType =>
>Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
>>  case DecimalType.Fixed(precision, scale) if precision <= 18 =>
>> -  // TODO: for now, our writer only supports decimals that fit
>in a Long
>>   
>Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
>
>We should int32, int64 if possible, see

>https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>```
>DECIMAL can be used to annotate the following types:
>
>int32: for 1 <= precision <= 9
>int64: for 1 <= precision <= 18; precision <= 10 will produce a warning
>fixed_len_byte_array: precision is limited by the array size. Length n
>can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
>binary: precision is not limited, but is required. The minimum number
>of bytes to store the unscaled value should be used.
>```
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r32889441

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-21 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32891379
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -289,7 +295,20 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
   name: String,
   nullable: Boolean = true,
   inArray: Boolean = false,
+  parquetSchema: Option[ParquetType] = None,
   toThriftSchemaNames: Boolean = false): ParquetType = {
+
+val parquetElementTypeBySchema =
--- End diff --

Ah, yes, an early version had that, I somehow moved to this verbose code O.o

Thanks

Am 21. Juni 2015 02:33:15 MESZ, schrieb Davies Liu 
:
>>toThriftSchemaNames: Boolean = false): ParquetType = {
>> +
>> +val parquetElementTypeBySchema =
>
>These could be
>```
>parquetSchema.filter(_.containsField(name)).map(_.getType(name))
>```
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/6796/files#r32889401

-- 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-20 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113834732
  
I've pushed a very early version of a fix. (Literally early, it's nearly 
1:00 am. And I'd expect the test build to fail, I'll fix the outstanding issues 
later today)

PS: Loading the hive parquet works now, but I've not yet tested much more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-20 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113814084
  
The problematic line is

https://github.com/apache/spark/blob/v1.4.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L105
(calling ParquetTypesConverter.convertFromAttributes)
This causes the schema to be normalized, but that does not work for decimal.

```
message hive_schema {
  optional int32 id;
  optional fixed_len_byte_array(13) value (DECIMAL(30,0));
}
```
is replaced by
```
message root {
  optional int32 id;
  optional binary value (DECIMAL(30,0));
}
```

I removed that line and the loading of the data works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-20 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113762279
  
Ok, I think I'm slowly getting down to the cause
strictTypeChecking on 
https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/ColumnIOFactory.java#L100
 creates the exception

The relevant class for the job setup is ParquetTableScan (doExecute).

I'm not yet sure if this can be fixed on the job setup or on the receiver 
side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-19 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113629961
  
I've pushed the hive generated parquet file and I'll call it a day.

I think I'll have to relax the validation of column types for DECIMAL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-19 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113628860
  
Ok, it looks like I can't open hive generated parquet files, but it looks 
more like a type error.

```
scala> val hive = sqlContext.load("/home/rtreffer/work/hadoop/hive-parquet")
warning: there was one deprecation warning; re-run with -deprecation for 
details
15/06/19 22:03:26 INFO ParquetFileReader: Initiating action with 
parallelism: 5
hive: org.apache.spark.sql.DataFrame = [id: int, value: decimal(30,0)]

scala> hive.collect.foreach(println)
15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 
localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_9_piece0 on 
localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(130208) called with 
curMem=0, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 127.2 KB, free 265.0 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(14082) called with 
curMem=130208, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10_piece0 stored as 
bytes in memory (estimated size 13.8 KB, free 265.0 MB)
15/06/19 22:03:35 INFO BlockManagerInfo: Added broadcast_10_piece0 in 
memory on localhost:42189 (size: 13.8 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO SparkContext: Created broadcast 10 from collect at 
:38
15/06/19 22:03:35 INFO SparkContext: Starting job: collect at :38
15/06/19 22:03:35 INFO DAGScheduler: Got job 9 (collect at :38) 
with 1 output partitions (allowLocal=false)
15/06/19 22:03:35 INFO DAGScheduler: Final stage: ResultStage 11(collect at 
:38)
15/06/19 22:03:35 INFO DAGScheduler: Parents of final stage: List()
15/06/19 22:03:35 INFO DAGScheduler: Missing parents: List()
15/06/19 22:03:35 INFO DAGScheduler: Submitting ResultStage 11 
(MapPartitionsRDD[45] at collect at :38), which has no missing parents
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(5568) called with 
curMem=144290, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11 stored as values in 
memory (estimated size 5.4 KB, free 265.0 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(2964) called with 
curMem=149858, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11_piece0 stored as 
bytes in memory (estimated size 2.9 KB, free 265.0 MB)
15/06/19 22:03:36 INFO BlockManagerInfo: Added broadcast_11_piece0 in 
memory on localhost:42189 (size: 2.9 KB, free: 265.1 MB)
15/06/19 22:03:36 INFO SparkContext: Created broadcast 11 from broadcast at 
DAGScheduler.scala:893
15/06/19 22:03:36 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 11 (MapPartitionsRDD[45] at collect at :38)
15/06/19 22:03:36 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks
15/06/19 22:03:36 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 
36, localhost, PROCESS_LOCAL, 1471 bytes)
15/06/19 22:03:36 INFO Executor: Running task 0.0 in stage 11.0 (TID 36)
15/06/19 22:03:36 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1: Input 
split: ParquetInputSplit{part: 
file:/home/rtreffer/work/hadoop/hive-parquet/00_0 start: 0 end: 874 length: 
874 hosts: []}
15/06/19 22:03:36 INFO InternalParquetRecordReader: RecordReader 
initialized will read a total of 33 records.
15/06/19 22:03:36 INFO InternalParquetRecordReader: at row 0. reading next 
block
15/06/19 22:03:36 INFO InternalParquetRecordReader: block read in memory in 
0 ms. row count = 33
15/06/19 22:03:36 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 
36)
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file file:/home/rtreffer/work/hadoop/hive-parquet/00_0
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.sca

[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-19 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113622541
  
(hive 1.2.0 and hadoop 2.7.0 without hdfs or a cluster)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176] Support decimal types with precis...

2015-06-19 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113621866
  
Just did a test with hive, I can declare a parquet file written with spark 
as

{code}
CREATE EXTERNAL TABLE ptest (id INT, value DECIMAL(30,0))
STORED AS PARQUET
LOCATION "file:///home/rtreffer/work/hadoop/parquet";
{code}

and it *does* work. I'm now trying to test the opposite direction plus a 
test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-18 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113277495
  
Yes, manually. I could add the file I was writing afterwards, sounds like a 
good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-18 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113276153
  
I was thinking about this: Create a small parquet file with spark, load it 
with hive, copy it to a new parquet table with hive, read that with spark.

If that matches the input -> win. Otherwise -> some more work.

PS: SPARK-8342 / SPARK-8359 were only problems during my initial tests. 
It's a bit harder to test read/write of Decimal if the implementation has bugs. 
So those are unrelated to this patch, but they might reduce the usefulness of 
this patch (you can't do reliable math in the ranges you could now load/save)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-18 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-113274389
  
The only reason for the WIP is that I have not yet cross-tested the 
interoperability with e.g. hive. It follows the spec, but I'd like to test it 
(or have s.o. verify this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-13 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/6796#discussion_r32371677
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala 
---
@@ -331,7 +331,7 @@ private[sql] class JDBCRDD(
   case BooleanType => BooleanConversion
   case DateType => DateConversion
   case DecimalType.Unlimited => DecimalConversion(None)
-  case DecimalType.Fixed(d) => DecimalConversion(Some(d))
+  case DecimalType.Fixed(d, s) => DecimalConversion(Some((d, s)))
--- End diff --

It caused a warning on my system. So I thought it would be better to make 
it explicit.

I could drop that line from this patch, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-13 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-111713135
  
Note: I came across https://issues.apache.org/jira/browse/SPARK-8342 while 
testing, it seems like Decimal math is unsafe at the moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-13 Thread rtreffer
GitHub user rtreffer opened a pull request:

https://github.com/apache/spark/pull/6796

[SPARK-4176][WIP] Support decimal types with precision > 18 in parquet

This is my current WIP on SPARK-4176. It should be compatible with other 
implementations of parquet.


https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal
```
For byte arrays, binary and fixed, the unscaled number must be encoded as
two's complement using big-endian byte order
(the most significant byte is the zeroth element)
```
This is the default encoding on bigint. It should thus be compatible with 
other implementations, although it would be great if s.o. could test this.

I've tested this locally with powers of 2 up to 2^200 in the spark shell, 
without errors but

Code I've used for (local) testing (on spark shell):
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.io.File

def decimalList(n : Int) = {
  require(n > 0)
  val one = Decimal(1)
  val two = Decimal(2)
  one.changePrecision(n,0)
  two.changePrecision(n,0)
  val list = scala.collection.mutable.ArrayBuffer(one)
  while (list.length < n) {
val v = list(list.length - 1)
list += Decimal(v.toJavaBigDecimal.multiply(two.toJavaBigDecimal),n,0)
  }
  list.toList
}

def decimalRows(l : List[Decimal]) = l.zipWithIndex.map(e => Row(e._2,e._1))

def decimalRowRdd(l : List[Decimal]) = sc.parallelize(decimalRows(l))

def df(n : Int) = {
  val data = decimalList(n)
  val len = data.lastOption.get.toString.length
  val schema = StructType(Array(
StructField("id", IntegerType, true),
StructField("value", DecimalType(len,0), true)))
  sqlContext.createDataFrame(decimalRowRdd(data), schema)
}

def delete(filename : String) : Unit = {
  val f = new File(filename)
  if (!f.exists) return
  if (f.isDirectory) f.listFiles.foreach(_.delete)
  f.delete
}

def test(n : Int) = {
  val src = df(n)
  delete("/tmp/typetest")
  src.save("/tmp/typetest")
  val copy = sqlContext.load("/tmp/typetest")
  
src.collect().sortBy(_.getInt(0)).zip(copy.collect().sortBy(_.getInt(0))).foreach(e
 => {
if (e._1 != e._2) println(s"${e._1} != ${e._2}")
  })
  delete("/tmp/typetest")
  println(s"Tested 1..2^${n - 1}")
}
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rtreffer/spark 
spark-4176-store-large-decimal-in-parquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6796


commit 32630dfa440f618021d5d0a1db86cb7a3ea5e2f4
Author: Rene Treffer 
Date:   2015-06-13T12:54:42Z

[SPARK-4176] Support decimal types with precision > 18




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...

2015-06-13 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6789#issuecomment-111690002
  
Oh, I'm really lucky with with jenkins, seems like it got a network error 
(or github quote or alike)

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34820/console

Rebased  on master to trigger another build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...

2015-06-12 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/6789#issuecomment-111594673
  
@viirya pushed the original patch, any objections?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7897] Improbe type for jdbc/"unsigned b...

2015-06-12 Thread rtreffer
GitHub user rtreffer opened a pull request:

https://github.com/apache/spark/pull/6789

[SPARK-7897] Improbe type for jdbc/"unsigned bigint"

The original fix uses DecimalType.Unlimited, which is harder to
handle afterwards. There is no scale and most data should fit into
a long, thus DecimalType(20,0) should be better.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rtreffer/spark 
spark-7897-unsigned-bigint-as-decimal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6789.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6789


commit e99052c5b8866c1f0ec98309c8bc249139552f5a
Author: Rene Treffer 
Date:   2015-06-12T07:39:42Z

Fix type for "unsigned bigint" jdbc loading.

The original fix uses DecimalType.Unlimited, which is harder to
handle afterwards. There is no scale and most data should fit into
a long, thus DecimalType(20,0) should be better.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-05-18 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-102951240
  
@rxin thank you, found the MimaExcludes, let's see if I've done that 
correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-05-17 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-102866270
  
Jenkins failes with

```
[info] spark-sql: found 5 potential binary incompatibilities (filtered 287)
[error]  * object org.apache.spark.sql.jdbc.DriverQuirks does not have a 
correspondent in new version
[error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.DriverQuirks$")
[error]  * class org.apache.spark.sql.jdbc.DriverQuirks does not have a 
correspondent in new version
[error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.DriverQuirks")
[error]  * class org.apache.spark.sql.jdbc.PostgresQuirks does not have a 
correspondent in new version
[error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.PostgresQuirks")
[error]  * class org.apache.spark.sql.jdbc.NoQuirks does not have a 
correspondent in new version
[error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.NoQuirks")
[error]  * class org.apache.spark.sql.jdbc.MySQLQuirks does not have a 
correspondent in new version
[error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.jdbc.MySQLQuirks")
```
but that was the intention of the rename.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-05-17 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-102865899
  
Rebased on master, squashed, added a case class for the JdbcType and got 
rid of null.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-05-16 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-102583414
  
It's on my todo list for this weekend.

Am 16. Mai 2015 08:15:38 MESZ, schrieb Reynold Xin 
:
    >@rtreffer do you have time to update this?
>
>---
>Reply to this email directly or view it on GitHub:
>https://github.com/apache/spark/pull/#issuecomment-102576337

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-05-07 Thread rtreffer
Github user rtreffer commented on a diff in the pull request:

https://github.com/apache/spark/pull/#discussion_r29850737
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala ---
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc
+
+import org.apache.spark.sql.types._
+import org.apache.spark.annotation.DeveloperApi
+
+import java.sql.Types
+
+
+/**
+ * :: DeveloperApi ::
+ * Encapsulates everything (extensions, workarounds, quirks) to handle the
+ * SQL dialect of a certain database or jdbc driver.
+ * Lots of databases define types that aren't explicitly supported
+ * by the JDBC spec.  Some JDBC drivers also report inaccurate
+ * information---for instance, BIT(n>1) being reported as a BIT type is 
quite
+ * common, even though BIT in JDBC is meant for single-bit values.  Also, 
there
+ * does not appear to be a standard name for an unbounded string or binary
+ * type; we use BLOB and CLOB by default but override with 
database-specific
+ * alternatives when these are absent or do not behave correctly.
+ *
+ * Currently, the only thing done by the dialect is type mapping.
+ * `getCatalystType` is used when reading from a JDBC table and 
`getJDBCType`
+ * is used when writing to a JDBC table.  If `getCatalystType` returns 
`null`,
+ * the default type handling is used for the given JDBC type.  Similarly,
+ * if `getJDBCType` returns `(null, None)`, the default type handling is 
used
+ * for the given Catalyst type.
+ */
+@DeveloperApi
+abstract class JdbcDialect {
+  /**
+   * Check if this dialect instance can handle a certain jdbc url.
+   * @param url the jdbc url.
+   * @return True if the dialect can be applied on the given jdbc url.
+   * @throws NullPointerException if the url is null.
+   */
+  def canHandle(url : String): Boolean
+
+  /**
+   * Get the custom datatype mapping for the given jdbc meta information.
+   * @param sqlType The sql type (see java.sql.Types)
+   * @param typeName The sql type name (e.g. "BIGINT UNSIGNED")
+   * @param size The size of the type.
+   * @param md Result metadata associated with this type.
+   * @return The actual DataType (subclasses of 
[[org.apache.spark.sql.types.DataType]])
+   * or null if the default type mapping should be used.
+   */
+  def getCatalystType(sqlType: Int, typeName: String, size: Int, md: 
MetadataBuilder): DataType =
+null
+
+  /**
+   * Retrieve the jdbc / sql type for a give datatype.
+   * @param dt The datatype (e.g. 
[[org.apache.spark.sql.types.StringType]])
+   * @return A tuple of sql type name and sql type, or {{{(null, None)}}} 
for no change.
+   */
+  def getJDBCType(dt: DataType): (String, Option[Int]) = (null, None)
--- End diff --

Sounds like a good idea. It would also clarify the magic 
getJDBCType(...)._1 and getJDBCType(...)._2 in jdbc.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-05-06 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-99427524
  
All the feedback regarding this has been addressed (hopefully) and test are 
green. What should I do next?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-04-30 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-97702611
  
Still no build :cry: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-04-27 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-96573899
  
@marmbrus what should we do now? New PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-04-25 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-96204329
  
@marmbrus I've fixed the linebreaks, but jenkins does not seem to pick up 
the changes and rerun the tests :-( What should I do?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-04-24 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-96058998
  
Thank you for the fast feedback, I'm quite impressed by the feedback 
speed/quality on the spark project.

I was traveling last week, so the fix was a bit delayed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks

2015-04-17 Thread rtreffer
Github user rtreffer closed the pull request at:

https://github.com/apache/spark/pull/5498


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks

2015-04-17 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/5498#issuecomment-93963311
  
Replaced by https://github.com/apache/spark/pull/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...

2015-04-17 Thread rtreffer
GitHub user rtreffer opened a pull request:

https://github.com/apache/spark/pull/

[SPARK-6888][SQL] Make the jdbc driver handling user-definable

Replace the DriverQuirks with JdbcDialect(s) (and 
MySQLDialect/PostgresDialect)
and allow developers to change the dialects on the fly (for new JDBCRRDs 
only).

Some types (like an unsigned 64bit number) can be trivially mapped to java.
The status quo is that the RRD will fail to load.
This patch makes it possible to overwrite the type mapping to read e.g.
64Bit numbers as strings and handle them afterwards in software.

JDBCSuite has an example that maps all types to String, which should always
work (at the cost of extra code afterwards).

As a side effect it should now be possible to develop simple dialects
out-of-tree and even with spark-shell.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rtreffer/spark jdbc-dialects

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #


commit 3e98f746a90a4764647fb9f4b17a416cab3f868a
Author: Rene Treffer 
Date:   2015-04-17T09:06:55Z

[SPARK-6888] Make the jdbc driver handling user-definable

Replace the DriverQuirks with JdbcDialect(s) (and 
MySQLDialect/PostgresDialect)
and allow developers to change the dialects on the fly (for new JDBCRRDs 
only).

Some types (like an unsigned 64bit number) can be trivially mapped to java.
The status quo is that the RRD will fail to load.
This patch makes it possible to overwrite the type mapping to read e.g.
64Bit numbers as strings and handle them afterwards in software.

JDBCSuite has an example that maps all types to String, which should always
work (at the cost of extra code afterwards).

As a side effect it should now be possible to develop simple dialects
out-of-tree and even with spark-shell.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks

2015-04-16 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/5498#issuecomment-93677337
  
@marmbrus thank you, I'll fix those issues and open a new one when done.

Regarding naming/api:

It is quite common that there is one class per sql/jdbc dialect. Often 
called that way (e.g. MySQLDialect on hibernate). I've found quite some 
projects that use the same naming (via github search).
Anyway, in this case I'd like to match those namings and add a default 
implementation per method (returning the neutral element).

On the other hand it's currently just doing type mapping. So 
JDBCTypeMapping would be a very valid name, too. It would restrict the use case 
more (can be good or bad).

I guess you know better what would suite spark :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks

2015-04-15 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/5498#issuecomment-93571593
  
I still have to write tests for AggregatedQuirks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6888][SQL] Export driver quirks

2015-04-15 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/5498#issuecomment-93272238
  
@liancheng thank you, will updaste the patch.
Just one question: Should I sqash/amend the fixes or should I add a second 
commit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Export driver quirks

2015-04-13 Thread rtreffer
Github user rtreffer commented on the pull request:

https://github.com/apache/spark/pull/5498#issuecomment-92521352
  
Added a ticket: https://issues.apache.org/jira/browse/SPARK-6888
Will add that to the commit after some sleep .zZzZzZ


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Export driver quirks

2015-04-13 Thread rtreffer
GitHub user rtreffer opened a pull request:

https://github.com/apache/spark/pull/5498

Export driver quirks

Make it possible to (temporary) overwrite the driver quirks. This
can be used to overcome problems with specific schemas or to
add new jdbc driver support on the fly.

A very simple implementation to dump the loading can be done like this 
(spark-shell)
```
class DumpQuirk extends org.apache.spark.sql.jdbc.DriverQuirks {
  def canHandle(url : String): Boolean = true
  def getCatalystType(sqlType: Int, typeName: String, size: Int, md: 
org.apache.spark.sql.types.MetadataBuilder): 
org.apache.spark.sql.types.DataType = {
println("" + (sqlType, typeName, size, md))
null
  }
  def getJDBCType(dt: org.apache.spark.sql.types.DataType): (String, 
Option[Int]) = (null, None)
}
org.apache.spark.sql.jdbc.DriverQuirks.registerQuirks(new DumpQuirk())
```

Not that this pull request is against 1.3 - I could not create a 
distribution with the current master.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rtreffer/spark export-driver-quirks

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5498


commit dca9372fc32e3222afc25e11394b1e620c115efa
Author: Rene Treffer 
Date:   2015-04-13T21:38:59Z

Export driver quirks

Make it possible to (temporary) overwrite the driver quirks. This
can be used to overcome problems with specific schemas or to
add new jdbc driver support on the fly.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org