[ 
https://issues.apache.org/jira/browse/DRILL-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080834#comment-16080834
 ] 

Paul Rogers edited comment on DRILL-4139 at 7/10/17 8:00 PM:
-------------------------------------------------------------

The suggestion to use Base64 encoding seems to be the right solution.

Bytes are not characers. Bytes span the full range from 0-255. Since our JSON 
is Unicode, encoded as UTF-8, some combination of bytes will be interpreted as 
multi-byte characters in Unicode. That is, we are abusing the software.

Base64 has the advantage that it is designed to be broken into lines, which can 
be encoded in JSON as an array.

Note that, for encoding purposes, we only care about the byte order in the 
buffer: left-to-right. The meaning of those bytes is unimportant for 
serialization. That is, whether the data is big-endian, little-endian, 
stream-of-bytes, or stream-of-multi-byte characters is important to the code 
that interprets the (decoded) bytes, but not to the serialization format.



was (Author: paul-rogers):
First impression is that we are serializing bytes all wrong. Bytes are not 
characers. Bytes span the full range from 0-255. Since our JSON is Unicode, 
encoded as UTF-8, some combination of bytes will be interpreted as multi-byte 
characters in Unicode. That is, we are abusing the software.

Correct format for bytes is using a binary format. Most basic:

{code}
'000AFF132D'
{code}

We interpret the above as two hex digits per byte, in left-to-right (lowest to 
highest) address order.

The Internet has a number of ways to store binary data in a more compact form. 
Base64 (RFC-4648) is popular and has built-in support in Java (the 
{{Base64.Encoder}}) class. For example, here is is an example Base64 string:

{code}
'Vm9sb2R5bXlyIFZ5c290c2t5aQ=='
{code}

Base64 has the advantage that it is designed to be broken into lines, which can 
be encoded in JSON as an array.

Note that, for encoding purposes, we only care about the byte order in the 
buffer: left-to-right. The meaning of those bytes is unimportant for 
serialization. That is, whether the data is big-endian, little-endian, 
stream-of-bytes, or stream-of-multi-byte characters is important to the code 
that interprets the (decoded) bytes, but not to the serialization format.


> Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
> -----------------------------------------------------------------
>
>                 Key: DRILL-4139
>                 URL: https://issues.apache.org/jira/browse/DRILL-4139
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.3.0
>         Environment: 4 node cluster on CentOS
>            Reporter: Khurram Faraaz
>            Assignee: Volodymyr Vysotskyi
>         Attachments: metadata file v3, metadata file with changes
>
>
> Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
> is seen in drillbit.log after Functional run on 4 node cluster.
> Drill 1.3.0 sys.version => d61bb83a8
> {code}
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  
> o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning 
> class: org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  
> o.a.d.e.p.l.partition.PruneScanRule - Total elapsed time to build and analyze 
> filter tree: 0 ms
> 2015-11-27 03:12:19,810 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] WARN  
> o.a.d.e.p.l.partition.PruneScanRule - Exception while trying to prune 
> partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
>         at 
> org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:479)
>  ~[drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96)
>  ~[drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:235)
>  ~[drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87)
>  [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
>  [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at 
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
>  [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at 
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) 
> [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at 
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:303) 
> [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.logicalPlanningVolcanoAndLopt(DefaultSqlHandler.java:545)
>  [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:213)
>  [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:248)
>  [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:164)
>  [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:184)
>  [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:905) 
> [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:244) 
> [drill-java-exec-1.3.0.jar:1.3.0]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_45]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_45]
>         at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to