[jira] [Commented] (DRILL-8390) Minor Improvements to PDF Reader

2023-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678473#comment-17678473
 ] 

ASF GitHub Bot commented on DRILL-8390:
---

cgivre opened a new pull request, #2742:
URL: https://github.com/apache/drill/pull/2742

   # [DRILL-8390](https://issues.apache.org/jira/browse/DRILL-8390): Minor 
Improvements to PDF Reader
   
   
   ## Description
   This PR makes some minor improvements to the PDF reader including:
   Fixes a minor bug where certain configurations the first row of data was 
skipped
   Fixes a minor bug where empty tables were causing crashes with the 
spreadsheet extraction algorithm was used
   Adds a `_table_count` metadata field
   Adds a `_table_index` metadata field to reflect the current table.
   
   ## Documentation
   See above.  Updated README.
   
   ## Testing
   Ran existing unit tests.  Manually tested against customer data.




> Minor Improvements to PDF Reader
> 
>
> Key: DRILL-8390
> URL: https://issues.apache.org/jira/browse/DRILL-8390
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Format - PDF
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>
> This PR makes some minor improvements to the PDF reader including:
>  * Fixes a minor bug where certain configurations the first row of data was 
> skipped
>  * Fixes a minor bug where empty tables were causing crashes with the 
> spreadsheet extraction algorithm was used
>  * Adds a table_count metadata field
>  * Adds a table_index metadata field to reflect the current table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8390) Minor Improvements to PDF Reader

2023-01-18 Thread Charles Givre (Jira)
Charles Givre created DRILL-8390:


 Summary: Minor Improvements to PDF Reader
 Key: DRILL-8390
 URL: https://issues.apache.org/jira/browse/DRILL-8390
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - PDF
Reporter: Charles Givre
Assignee: Charles Givre


This PR makes some minor improvements to the PDF reader including:
 * Fixes a minor bug where certain configurations the first row of data was 
skipped
 * Fixes a minor bug where empty tables were causing crashes with the 
spreadsheet extraction algorithm was used
 * Adds a table_count metadata field
 * Adds a table_index metadata field to reflect the current table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8388) Undesired query cancellation results in zero-byte Parquet files

2023-01-18 Thread James Turton (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton updated DRILL-8388:

Summary: Undesired query cancellation results in zero-byte Parquet files  
(was: Zero-record Parquet writer fragments result in query cancellation and 
zero-byte Parquet files)

> Undesired query cancellation results in zero-byte Parquet files
> ---
>
> Key: DRILL-8388
> URL: https://issues.apache.org/jira/browse/DRILL-8388
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Writer
>Affects Versions: 1.20.3
>Reporter: James Turton
>Assignee: James Turton
>Priority: Major
> Fix For: 1.21.0
>
>
> I'll refine this ticket as I discover more but at the current time I believe 
> this bug can reproduced as follows.
>  # The Drill writer format is set to Parquet.
>  # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
> for the same query received over REST).
>  # The CTAS statement spawns multiple Parquet writer fragments.
>  # The query is apparently cancelled (by the Drill/JDBC client?) before all 
> of the writer fragments have completed.
>  # Some writer fragments have created no output file at all. Others have 
> created invalid, zero-byte Parquet files. Others have created valid empty 
> Parquet files and others have created valid non-empty Parquet files.
>  # A subsequent query against the destination fails because it encounters 
> zero-byte Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8388) Zero-record Parquet writer fragments result in query cancellation and zero-byte Parquet files

2023-01-18 Thread James Turton (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton updated DRILL-8388:

Description: 
I'll refine this ticket as I discover more but at the current time I believe 
this bug can reproduced as follows.
 # The Drill writer format is set to Parquet.
 # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
for the same query received over REST).
 # The CTAS statement spawns multiple Parquet writer fragments.
 # The query is apparently cancelled (by the Drill/JDBC client?) before all of 
the writer fragments have completed.
 # Some writer fragments have created no output file at all. Others have 
created invalid, zero-byte Parquet files. Others have created valid empty 
Parquet files and others have created valid non-empty Parquet files.
 # A subsequent query against the destination fails because it encounters 
zero-byte Parquet files.

  was:
I'll refine this ticket as I discover more but at the current time I believe 
this bug can reproduced as follows.
 # The Drill writer format is set to Parquet.
 # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
for the same query received over REST).
 # The CTAS statement spawns multiple Parquet writer fragments. It may also be 
necessary that these fragments are distributed over more than one Drillbit 
(unconfirmed on a single Drillbit).
 # The query is apparently cancelled (by the Drill/JDBC client?) before all of 
the writer fragments have completed.
 # Some writer fragments have created no output file at all. Others have 
created invalid, zero-byte Parquet files. Others have created valid empty 
Parquet files and others have created valid non-empty Parquet files.
 # A subsequent query against the destination fails because it encounters 
zero-byte Parquet files.


> Zero-record Parquet writer fragments result in query cancellation and 
> zero-byte Parquet files
> -
>
> Key: DRILL-8388
> URL: https://issues.apache.org/jira/browse/DRILL-8388
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Writer
>Affects Versions: 1.20.3
>Reporter: James Turton
>Assignee: James Turton
>Priority: Major
> Fix For: 1.21.0
>
>
> I'll refine this ticket as I discover more but at the current time I believe 
> this bug can reproduced as follows.
>  # The Drill writer format is set to Parquet.
>  # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
> for the same query received over REST).
>  # The CTAS statement spawns multiple Parquet writer fragments.
>  # The query is apparently cancelled (by the Drill/JDBC client?) before all 
> of the writer fragments have completed.
>  # Some writer fragments have created no output file at all. Others have 
> created invalid, zero-byte Parquet files. Others have created valid empty 
> Parquet files and others have created valid non-empty Parquet files.
>  # A subsequent query against the destination fails because it encounters 
> zero-byte Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8388) Zero-record Parquet writer fragments result in query cancellation and zero-byte Parquet files

2023-01-18 Thread James Turton (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton updated DRILL-8388:

Description: 
I'll refine this ticket as I discover more but at the current time I believe 
this bug can reproduced as follows.
 # The Drill writer format is set to Parquet.
 # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
for the same query received over REST).
 # The CTAS statement spawns multiple Parquet writer fragments. It may also be 
necessary that these fragments are distributed over more than one Drillbit 
(unconfirmed on a single Drillbit).
 # The query is apparently cancelled (by the Drill/JDBC client?) before all of 
the writer fragments have completed.
 # Some writer fragments have created no output file at all. Others have 
created invalid, zero-byte Parquet files. Others have created valid empty 
Parquet files and others have created valid non-empty Parquet files.
 # A subsequent query against the destination fails because it encounters 
zero-byte Parquet files.

  was:
I'll refine this ticket as I discover more but at the current time I believe 
this bug can reproduced as follows.
 # The Drill writer format is set to Parquet.
 # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
for the same query received over REST).
 # The CTAS statement spawns multiple Parquet writer fragments. It may also be 
necessary that these fragments are distributed over more than one Drillbit 
(unconfirmed on a single Drillbit).
 # Some of the Parquet writer fragments receive batches containing zero records.
 # The query is apparently cancelled (by the Drill/JDBC client?) before all of 
the writer fragments have completed.
 # Some writer fragments have created no output file at all. Others have 
created invalid, zero-byte Parquet files. Others have created valid empty 
Parquet files and others have created valid non-empty Parquet files.
 # A subsequent query against the destination fails because it encounters 
zero-byte Parquet files.


> Zero-record Parquet writer fragments result in query cancellation and 
> zero-byte Parquet files
> -
>
> Key: DRILL-8388
> URL: https://issues.apache.org/jira/browse/DRILL-8388
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Writer
>Affects Versions: 1.20.3
>Reporter: James Turton
>Assignee: James Turton
>Priority: Major
> Fix For: 1.21.0
>
>
> I'll refine this ticket as I discover more but at the current time I believe 
> this bug can reproduced as follows.
>  # The Drill writer format is set to Parquet.
>  # A CTAS statement is issued over JDBC (the bug does not appear to manifest 
> for the same query received over REST).
>  # The CTAS statement spawns multiple Parquet writer fragments. It may also 
> be necessary that these fragments are distributed over more than one Drillbit 
> (unconfirmed on a single Drillbit).
>  # The query is apparently cancelled (by the Drill/JDBC client?) before all 
> of the writer fragments have completed.
>  # Some writer fragments have created no output file at all. Others have 
> created invalid, zero-byte Parquet files. Others have created valid empty 
> Parquet files and others have created valid non-empty Parquet files.
>  # A subsequent query against the destination fails because it encounters 
> zero-byte Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)