[jira] [Commented] (DRILL-8390) Minor Improvements to PDF Reader
[ https://issues.apache.org/jira/browse/DRILL-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678473#comment-17678473 ] ASF GitHub Bot commented on DRILL-8390: --- cgivre opened a new pull request, #2742: URL: https://github.com/apache/drill/pull/2742 # [DRILL-8390](https://issues.apache.org/jira/browse/DRILL-8390): Minor Improvements to PDF Reader ## Description This PR makes some minor improvements to the PDF reader including: Fixes a minor bug where certain configurations the first row of data was skipped Fixes a minor bug where empty tables were causing crashes with the spreadsheet extraction algorithm was used Adds a `_table_count` metadata field Adds a `_table_index` metadata field to reflect the current table. ## Documentation See above. Updated README. ## Testing Ran existing unit tests. Manually tested against customer data. > Minor Improvements to PDF Reader > > > Key: DRILL-8390 > URL: https://issues.apache.org/jira/browse/DRILL-8390 > Project: Apache Drill > Issue Type: Improvement > Components: Format - PDF >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > > This PR makes some minor improvements to the PDF reader including: > * Fixes a minor bug where certain configurations the first row of data was > skipped > * Fixes a minor bug where empty tables were causing crashes with the > spreadsheet extraction algorithm was used > * Adds a table_count metadata field > * Adds a table_index metadata field to reflect the current table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8390) Minor Improvements to PDF Reader
Charles Givre created DRILL-8390: Summary: Minor Improvements to PDF Reader Key: DRILL-8390 URL: https://issues.apache.org/jira/browse/DRILL-8390 Project: Apache Drill Issue Type: Improvement Components: Format - PDF Reporter: Charles Givre Assignee: Charles Givre This PR makes some minor improvements to the PDF reader including: * Fixes a minor bug where certain configurations the first row of data was skipped * Fixes a minor bug where empty tables were causing crashes with the spreadsheet extraction algorithm was used * Adds a table_count metadata field * Adds a table_index metadata field to reflect the current table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8388) Undesired query cancellation results in zero-byte Parquet files
[ https://issues.apache.org/jira/browse/DRILL-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Turton updated DRILL-8388: Summary: Undesired query cancellation results in zero-byte Parquet files (was: Zero-record Parquet writer fragments result in query cancellation and zero-byte Parquet files) > Undesired query cancellation results in zero-byte Parquet files > --- > > Key: DRILL-8388 > URL: https://issues.apache.org/jira/browse/DRILL-8388 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Writer >Affects Versions: 1.20.3 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.21.0 > > > I'll refine this ticket as I discover more but at the current time I believe > this bug can reproduced as follows. > # The Drill writer format is set to Parquet. > # A CTAS statement is issued over JDBC (the bug does not appear to manifest > for the same query received over REST). > # The CTAS statement spawns multiple Parquet writer fragments. > # The query is apparently cancelled (by the Drill/JDBC client?) before all > of the writer fragments have completed. > # Some writer fragments have created no output file at all. Others have > created invalid, zero-byte Parquet files. Others have created valid empty > Parquet files and others have created valid non-empty Parquet files. > # A subsequent query against the destination fails because it encounters > zero-byte Parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8388) Zero-record Parquet writer fragments result in query cancellation and zero-byte Parquet files
[ https://issues.apache.org/jira/browse/DRILL-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Turton updated DRILL-8388: Description: I'll refine this ticket as I discover more but at the current time I believe this bug can reproduced as follows. # The Drill writer format is set to Parquet. # A CTAS statement is issued over JDBC (the bug does not appear to manifest for the same query received over REST). # The CTAS statement spawns multiple Parquet writer fragments. # The query is apparently cancelled (by the Drill/JDBC client?) before all of the writer fragments have completed. # Some writer fragments have created no output file at all. Others have created invalid, zero-byte Parquet files. Others have created valid empty Parquet files and others have created valid non-empty Parquet files. # A subsequent query against the destination fails because it encounters zero-byte Parquet files. was: I'll refine this ticket as I discover more but at the current time I believe this bug can reproduced as follows. # The Drill writer format is set to Parquet. # A CTAS statement is issued over JDBC (the bug does not appear to manifest for the same query received over REST). # The CTAS statement spawns multiple Parquet writer fragments. It may also be necessary that these fragments are distributed over more than one Drillbit (unconfirmed on a single Drillbit). # The query is apparently cancelled (by the Drill/JDBC client?) before all of the writer fragments have completed. # Some writer fragments have created no output file at all. Others have created invalid, zero-byte Parquet files. Others have created valid empty Parquet files and others have created valid non-empty Parquet files. # A subsequent query against the destination fails because it encounters zero-byte Parquet files. > Zero-record Parquet writer fragments result in query cancellation and > zero-byte Parquet files > - > > Key: DRILL-8388 > URL: https://issues.apache.org/jira/browse/DRILL-8388 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Writer >Affects Versions: 1.20.3 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.21.0 > > > I'll refine this ticket as I discover more but at the current time I believe > this bug can reproduced as follows. > # The Drill writer format is set to Parquet. > # A CTAS statement is issued over JDBC (the bug does not appear to manifest > for the same query received over REST). > # The CTAS statement spawns multiple Parquet writer fragments. > # The query is apparently cancelled (by the Drill/JDBC client?) before all > of the writer fragments have completed. > # Some writer fragments have created no output file at all. Others have > created invalid, zero-byte Parquet files. Others have created valid empty > Parquet files and others have created valid non-empty Parquet files. > # A subsequent query against the destination fails because it encounters > zero-byte Parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8388) Zero-record Parquet writer fragments result in query cancellation and zero-byte Parquet files
[ https://issues.apache.org/jira/browse/DRILL-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Turton updated DRILL-8388: Description: I'll refine this ticket as I discover more but at the current time I believe this bug can reproduced as follows. # The Drill writer format is set to Parquet. # A CTAS statement is issued over JDBC (the bug does not appear to manifest for the same query received over REST). # The CTAS statement spawns multiple Parquet writer fragments. It may also be necessary that these fragments are distributed over more than one Drillbit (unconfirmed on a single Drillbit). # The query is apparently cancelled (by the Drill/JDBC client?) before all of the writer fragments have completed. # Some writer fragments have created no output file at all. Others have created invalid, zero-byte Parquet files. Others have created valid empty Parquet files and others have created valid non-empty Parquet files. # A subsequent query against the destination fails because it encounters zero-byte Parquet files. was: I'll refine this ticket as I discover more but at the current time I believe this bug can reproduced as follows. # The Drill writer format is set to Parquet. # A CTAS statement is issued over JDBC (the bug does not appear to manifest for the same query received over REST). # The CTAS statement spawns multiple Parquet writer fragments. It may also be necessary that these fragments are distributed over more than one Drillbit (unconfirmed on a single Drillbit). # Some of the Parquet writer fragments receive batches containing zero records. # The query is apparently cancelled (by the Drill/JDBC client?) before all of the writer fragments have completed. # Some writer fragments have created no output file at all. Others have created invalid, zero-byte Parquet files. Others have created valid empty Parquet files and others have created valid non-empty Parquet files. # A subsequent query against the destination fails because it encounters zero-byte Parquet files. > Zero-record Parquet writer fragments result in query cancellation and > zero-byte Parquet files > - > > Key: DRILL-8388 > URL: https://issues.apache.org/jira/browse/DRILL-8388 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Writer >Affects Versions: 1.20.3 >Reporter: James Turton >Assignee: James Turton >Priority: Major > Fix For: 1.21.0 > > > I'll refine this ticket as I discover more but at the current time I believe > this bug can reproduced as follows. > # The Drill writer format is set to Parquet. > # A CTAS statement is issued over JDBC (the bug does not appear to manifest > for the same query received over REST). > # The CTAS statement spawns multiple Parquet writer fragments. It may also > be necessary that these fragments are distributed over more than one Drillbit > (unconfirmed on a single Drillbit). > # The query is apparently cancelled (by the Drill/JDBC client?) before all > of the writer fragments have completed. > # Some writer fragments have created no output file at all. Others have > created invalid, zero-byte Parquet files. Others have created valid empty > Parquet files and others have created valid non-empty Parquet files. > # A subsequent query against the destination fails because it encounters > zero-byte Parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)