[
https://issues.apache.org/jira/browse/DRILL-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802012#comment-17802012
]
ASF GitHub Bot commented on DRILL-2835:
---------------------------------------
paul-rogers commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1874845274
Hi Mike,
Just jumping in with a random thought. Drill has accumulated a number of
schema systems: Parquet metadata cache, HMS, Drill's own metastore,
"provided schema", and now DFDL. All provide ways of defining data: be it
Parquet, JSON, CSV or whatever. One can't help but wonder, should some
future version try to reduce this variation somewhat? Maybe map all the
variations to DFDL? Map DFDL to Drill's own mechanisms?
Drill uses two kinds of metadata: schema definitions and file metadata used
for scan pruning. Schema information could be used at plan time (to provide
column types), but certainly at scan time (to "discover" the defined
schema.) File metadata is used primarily at plan time to work out how to
distribute work.
A bit of background on scan pruning. Back in the day, it was common to have
thousands or millions of files in Hadoop to scan: this was why tools like
Drill were distributed: divide and conquer. And, of course, the fastest
scan is to skip files that we know can't contain the information we want.
File metadata captures this information outside of the files themselves.
HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is
evidently based on HMS.)
For example, Drill's Parquet metadata cache, the Drill metastore and HMS
all provide both schema and file metadata information. The schema
information mainly helped with schema evolution: over time, different files
have different sets of columns. File metadata provides information *about*
the file, such as the data ranges stored in each file. For Parquet, we
might track that '2023-01-Boston.parquet' has data from the office='Boston'
range. (So, no use scanning the file for office='Austin'.) And so on.
With Hadoop HFS, it was customary to use directory structure as a partial
primary index: our file above would live in the /sales/2023/01 directory,
for example, and logic chooses the proper set of directories to scan. In
Drill, it is up to the user to add crufty conditionals on the path name. In
Impala, and other HMS-aware tools, the user just says WHERE order_year =
2023 AND order_month = 1, and HMS tells the tool that the order_year and
order_month columns translate to such-and-so directory paths. Would be nice
if Drill could provide that feature as well, given the proper file
metadata: in this case, the mapping of column names to path directories and
file names.
Does DFDL provide only schema information? Does it support versioning so
that we know that "old.csv" lacks the "version" column, while "new.csv"
includes that column? Does it also include the kinds of file metadata
mentioned above?
Or, perhaps DFDL is used in a different context in which the files have a
fixed schema and are small in number? This would fit well the "desktop
analytics" model that Charles and James suggested is where Drill is now
most commonly used.
The answers might suggest if DFDL can be the universal data description. or
if DFDL applies just to individual file schemas, and Drill would still need
a second system to track schema evolution and file metadata for large
deployments.
Further, if DFDL is kind of a stand-alone thing, with its own reader, then
we end up with more complexity: the Drill JSON reader and the DFDL JSON
reader. Same for CSV, etc. JSON is so complex that we'd find ourselves
telling people that the quirks work one way with the native reader, another
way with DFDL. Plus, the DFDL readers might not handle file splits the same
way, or support the same set of formats that Drill's other readers support,
and so on. It would be nice to separate the idea of schema description from
reader implementation, so that DFDL can be used as a source of schema for
any arbitrary reader: both at plan and scan times.
If DFDL uses its own readers, then we'd need DFDL reader representations in
Calcite, which would pick up DFDL schemas so that the schemas are reliably
serialized out to each node as part of the physical plan. This is possible,
but it does send us down the two-readers-for-every-format path.
On the other hand, if DFDL mapped to Drill's existing schema description,
then DFDL could be used with our existing readers and there would be just
one schema description sent to readers: Drill's existing provided schema
format that EVF can already consume. At present, just a few formats support
provided schema in the Calcite layer: CSV for sure, maybe JSON?
Any thoughts on where this kind of thing might evolve with DFDL in the
picture?
Thanks,
- Paul
On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle ***@***.***>
wrote:
> @cgivre <https://github.com/cgivre> yes, the next architectural-level
> issue is how to get a compiled DFDL schema out to everyplace Drill will run
> a Daffodil parse. Every one of those JVMs needs to reload it.
>
> I'll do the various cleanups and such. The one issue I don't know how to
> fix is the "typed setter" vs. (set-object) issue, so if you could steer me
> in the right direction on that it would help.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/drill/pull/2836#issuecomment-1874213780>, or
> unsubscribe
>
<https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
> IndexOutOfBoundsException in partition sender when doing streaming aggregate
> with LIMIT
> ----------------------------------------------------------------------------------------
>
> Key: DRILL-2835
> URL: https://issues.apache.org/jira/browse/DRILL-2835
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - RPC
> Affects Versions: 0.8.0
> Reporter: Aman Sinha
> Assignee: Venki Korukanti
> Priority: Major
> Fix For: 0.9.0
>
> Attachments: DRILL-2835-1.patch, DRILL-2835-2.patch
>
>
> Following CTAS run on a TPC-DS 100GB scale factor on a 10-node cluster:
> {code}
> alter session set `planner.enable_hashagg` = false;
> alter session set `planner.enable_multiphase_agg` = true;
> create table dfs.tmp.stream9 as
> select cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk ,
> cr_refunded_addr_sk , count(*) from catalog_returns_dri100
> group by cr_call_center_sk , cr_catalog_page_sk , cr_item_sk , cr_reason_sk
> , cr_refunded_addr_sk
> limit 100
> ;
> {code}
> {code}
> Caused by: java.lang.IndexOutOfBoundsException: index: 1023, length: 1
> (expected: range(0, 0))
> at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:200)
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final]
> at io.netty.buffer.DrillBuf.chk(DrillBuf.java:222)
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final]
> at io.netty.buffer.DrillBuf.setByte(DrillBuf.java:621)
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:4.0.24.Final]
> at
> org.apache.drill.exec.vector.UInt1Vector$Mutator.set(UInt1Vector.java:342)
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at
> org.apache.drill.exec.vector.NullableBigIntVector$Mutator.set(NullableBigIntVector.java:372)
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at
> org.apache.drill.exec.vector.NullableBigIntVector.copyFrom(NullableBigIntVector.java:284)
> ~[drill-java-exec-0.9.0-SNAPSHOT-rebuffed.jar:0.9.0-SNAPSHOT]
> at
> org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.doEval(PartitionerTemplate.java:370)
> ~[na:na]
> at
> org.apache.drill.exec.test.generated.PartitionerGen4$OutgoingRecordBatch.copy(PartitionerTemplate.java:249)
> ~[na:na]
> at
> org.apache.drill.exec.test.generated.PartitionerGen4.doCopy(PartitionerTemplate.java:208)
> ~[na:na]
> at
> org.apache.drill.exec.test.generated.PartitionerGen4.partitionBatch(PartitionerTemplate.java:176)
> ~[na:na]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)