[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523102#comment-17523102 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851626806 ## exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/ScanSchemaResolver.java: ## @@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) { switch (mode) { case FIRST_READER_SCHEMA: case READER_SCHEMA: -if (schema.projectionType() != ProjectionType.ALL) { +if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) { Review Comment: Let's look at this case : 1. batch reader received the project column, from ` SELECT path, data_type, file_name FROM ` in the constructor(). 2. schema is [TupleSchema [ProjectedColumn [`path` (LATE:REQUIRED)]], [ProjectedColumn [`data_type` (LATE:REQUIRED)]], [ProjectedColumn [`file_name` (LATE:REQUIRED)]]]. 3. ProjectionType = 'SOME'. 4. batch reader create new repeated list column in next(). 5. do the project for the schema. ```java ColumnHandle existing = schema.find(colSchema.name()); if (existing == null) { insertColumn(colSchema); } ``` 6. catch and throw the `IllegalStateException`. 7. failed to reader the format data. In EVF1, creating repeated list fields dynamically is allowed and does not throw such exceptions. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523100#comment-17523100 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851627884 ## exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/MutableTupleSchema.java: ## @@ -173,7 +173,7 @@ public ColumnHandle insert(int posn, ColumnMetadata col) { } public ColumnHandle insert(ColumnMetadata col) { -return insert(insertPoint++, col); +return insert(insertPoint == -1 ? size() : insertPoint++, col); Review Comment: When I fixed the following error, I received a new one : Dynamically create a repeated list field (and success) in next(), then this field may be assigned to the index value of -1. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523098#comment-17523098 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851627022 ## exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/v3/TestScanOuputSchema.java: ## @@ -327,4 +375,50 @@ public void testStrictProvidedSchemaWithWildcardAndSpecialCols() { assertFalse(scan.next()); scanFixture.close(); } + + @Test + public void testProvidedSchemaWithListArray() { +TupleMetadata providedSchema = new SchemaBuilder() +.add("a", MinorType.INT) +.buildSchema(); + +BaseScanFixtureBuilder builder = new BaseScanFixtureBuilder(fixture); +builder.addReader(negotiator -> new MockSimpleReader(negotiator, true)); +builder.builder.providedSchema(providedSchema); +builder.setProjection(new String[] { "a", "b", "c" }); +builder.builder.nullType(Types.optional(MinorType.VARCHAR)); +ScanFixture scanFixture = builder.build(); +ScanOperatorExec scan = scanFixture.scanOp; + +TupleMetadata expectedSchema = new SchemaBuilder() +.add("a", MinorType.INT) +.add("b", MinorType.VARCHAR) +.add("c", MinorType.VARCHAR) +.addRepeatedList("int_list") + .addArray(MinorType.INT) + .resumeSchema() +.addRepeatedList("long_list") Review Comment: These tests are to verify the above fix. Is it better to split into a new test class? > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523097#comment-17523097 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851626806 ## exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/ScanSchemaResolver.java: ## @@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) { switch (mode) { case FIRST_READER_SCHEMA: case READER_SCHEMA: -if (schema.projectionType() != ProjectionType.ALL) { +if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) { Review Comment: Let's look at this case : 1. batch reader received the project column, from ` SELECT path, data_type, file_name FROM ` in the constructor(). 2. schema is [TupleSchema [ProjectedColumn [`path` (LATE:REQUIRED)]], [ProjectedColumn [`data_type` (LATE:REQUIRED)]], [ProjectedColumn [`file_name` (LATE:REQUIRED)]]]. 3. ProjectionType = 'SOME'. 4. batch reader create new array column in next(). 5. do the project for the schema. ```java ColumnHandle existing = schema.find(colSchema.name()); if (existing == null) { insertColumn(colSchema); } ``` 6. catch and throw the `IllegalStateException`. 7. failed to reader the format data. In EVF1, creating array fields dynamically is allowed and does not throw such exceptions. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523093#comment-17523093 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851624724 ## contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java: ## @@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, HDF5FormatConfig formatConfig) } } - public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) { -this.readerConfig = readerConfig; -this.maxRecords = maxRecords; + public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, FileSchemaNegotiator negotiator) { +errorContext = negotiator.parentErrorContext(); +file = negotiator.file(); +readerConfig = config; dataWriters = new ArrayList<>(); -this.showMetadataPreview = readerConfig.formatConfig.showPreview(); - } +showMetadataPreview = readerConfig.formatConfig.showPreview(); - @Override - public boolean open(FileSchemaNegotiator negotiator) { -split = negotiator.split(); -errorContext = negotiator.parentErrorContext(); // Since the HDF file reader uses a stream to actually read the file, the file name from the // module is incorrect. -fileName = split.getPath().getName(); -try { - openFile(negotiator); -} catch (IOException e) { - throw UserException -.dataReadError(e) -.addContext("Failed to close input file: %s", split.getPath()) -.addContext(errorContext) -.build(logger); -} +fileName = file.split().getPath().getName(); -ResultSetLoader loader; -if (readerConfig.defaultPath == null) { - // Get file metadata - List metadata = getFileMetadata(hdfFile, new ArrayList<>()); - metadataIterator = metadata.iterator(); - - // Schema for Metadata query - SchemaBuilder builder = new SchemaBuilder() -.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT) -.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT) -.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT) -.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR) -.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR); - - negotiator.tableSchema(builder.buildSchema(), false); - - loader = negotiator.build(); - dimensions = new int[0]; - rowWriter = loader.writer(); - -} else { - // This is the case when the default path is specified. Since the user is explicitly asking for a dataset - // Drill can obtain the schema by getting the datatypes below and ultimately mapping that schema to columns - Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath); - dimensions = dataSet.getDimensions(); - - loader = negotiator.build(); - rowWriter = loader.writer(); - writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(), - negotiator.parentErrorContext()); - if (dimensions.length <= 1) { -buildSchemaFor1DimensionalDataset(dataSet); - } else if (dimensions.length == 2) { -buildSchemaFor2DimensionalDataset(dataSet); - } else { -// Case for datasets of greater than 2D -// These are automatically flattened -buildSchemaFor2DimensionalDataset(dataSet); +{ // Opens an HDF5 file Review Comment: Interesting thing, when we use EVF2, the batch reader only needs a constructor to start all initialization operations. But we will write a lot of code into that function, initialize the final variables, open files, define the schema, and so on. So I used the code blocks to group these actions to make them easier to read. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523091#comment-17523091 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851623857 ## contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java: ## @@ -410,8 +390,22 @@ public boolean next() { return true; } + @Override + public void close() { +AutoCloseables.closeSilently(hdfFile, reader); +/* + * The current implementation of the HDF5 reader creates a temp file which needs to be removed + * when the batch reader is closed. A possible future functionality might be to used to. + */ +boolean result = hdfFile.getFile().delete(); Review Comment: https://github.com/jamesmudd/jhdf/blob/3136bba6d5c6d6956bea006865c7241b6ddc6f25/jhdf/src/main/java/io/jhdf/HdfFile.java#L167-L169 In the above code, these temporary files are deleted only before application exits. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523086#comment-17523086 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851623334 ## contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java: ## @@ -410,8 +390,22 @@ public boolean next() { return true; } + @Override + public void close() { +AutoCloseables.closeSilently(hdfFile, reader); +/* + * The current implementation of the HDF5 reader creates a temp file which needs to be removed + * when the batch reader is closed. A possible future functionality might be to used to. Review Comment: Done, I added new text for the note. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523083#comment-17523083 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851623187 ## contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java: ## @@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, HDF5FormatConfig formatConfig) } } - public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) { -this.readerConfig = readerConfig; -this.maxRecords = maxRecords; + public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, FileSchemaNegotiator negotiator) { +errorContext = negotiator.parentErrorContext(); +file = negotiator.file(); +readerConfig = config; dataWriters = new ArrayList<>(); -this.showMetadataPreview = readerConfig.formatConfig.showPreview(); - } +showMetadataPreview = readerConfig.formatConfig.showPreview(); - @Override - public boolean open(FileSchemaNegotiator negotiator) { -split = negotiator.split(); -errorContext = negotiator.parentErrorContext(); // Since the HDF file reader uses a stream to actually read the file, the file name from the // module is incorrect. -fileName = split.getPath().getName(); -try { - openFile(negotiator); -} catch (IOException e) { - throw UserException -.dataReadError(e) -.addContext("Failed to close input file: %s", split.getPath()) -.addContext(errorContext) -.build(logger); -} +fileName = file.split().getPath().getName(); -ResultSetLoader loader; -if (readerConfig.defaultPath == null) { - // Get file metadata - List metadata = getFileMetadata(hdfFile, new ArrayList<>()); - metadataIterator = metadata.iterator(); - - // Schema for Metadata query - SchemaBuilder builder = new SchemaBuilder() -.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT) -.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT) -.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT) -.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR) -.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR); - - negotiator.tableSchema(builder.buildSchema(), false); - - loader = negotiator.build(); - dimensions = new int[0]; - rowWriter = loader.writer(); - -} else { - // This is the case when the default path is specified. Since the user is explicitly asking for a dataset - // Drill can obtain the schema by getting the datatypes below and ultimately mapping that schema to columns - Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath); - dimensions = dataSet.getDimensions(); - - loader = negotiator.build(); - rowWriter = loader.writer(); - writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(), - negotiator.parentErrorContext()); - if (dimensions.length <= 1) { -buildSchemaFor1DimensionalDataset(dataSet); - } else if (dimensions.length == 2) { -buildSchemaFor2DimensionalDataset(dataSet); - } else { -// Case for datasets of greater than 2D -// These are automatically flattened -buildSchemaFor2DimensionalDataset(dataSet); +{ // Opens an HDF5 file + InputStream in = null; + try { Review Comment: Done, thanks. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523084#comment-17523084 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851623209 ## contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java: ## @@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, HDF5FormatConfig formatConfig) } } - public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) { -this.readerConfig = readerConfig; -this.maxRecords = maxRecords; + public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, FileSchemaNegotiator negotiator) { +errorContext = negotiator.parentErrorContext(); +file = negotiator.file(); +readerConfig = config; dataWriters = new ArrayList<>(); -this.showMetadataPreview = readerConfig.formatConfig.showPreview(); - } +showMetadataPreview = readerConfig.formatConfig.showPreview(); - @Override - public boolean open(FileSchemaNegotiator negotiator) { -split = negotiator.split(); -errorContext = negotiator.parentErrorContext(); // Since the HDF file reader uses a stream to actually read the file, the file name from the // module is incorrect. -fileName = split.getPath().getName(); -try { - openFile(negotiator); -} catch (IOException e) { - throw UserException -.dataReadError(e) -.addContext("Failed to close input file: %s", split.getPath()) -.addContext(errorContext) -.build(logger); -} +fileName = file.split().getPath().getName(); -ResultSetLoader loader; -if (readerConfig.defaultPath == null) { - // Get file metadata - List metadata = getFileMetadata(hdfFile, new ArrayList<>()); - metadataIterator = metadata.iterator(); - - // Schema for Metadata query - SchemaBuilder builder = new SchemaBuilder() -.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT) -.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT) -.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT) -.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR) -.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR); - - negotiator.tableSchema(builder.buildSchema(), false); - - loader = negotiator.build(); - dimensions = new int[0]; - rowWriter = loader.writer(); - -} else { - // This is the case when the default path is specified. Since the user is explicitly asking for a dataset - // Drill can obtain the schema by getting the datatypes below and ultimately mapping that schema to columns - Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath); - dimensions = dataSet.getDimensions(); - - loader = negotiator.build(); - rowWriter = loader.writer(); - writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(), - negotiator.parentErrorContext()); - if (dimensions.length <= 1) { -buildSchemaFor1DimensionalDataset(dataSet); - } else if (dimensions.length == 2) { -buildSchemaFor2DimensionalDataset(dataSet); - } else { -// Case for datasets of greater than 2D -// These are automatically flattened -buildSchemaFor2DimensionalDataset(dataSet); +{ // Opens an HDF5 file + InputStream in = null; + try { +/* + * As a possible future improvement, the jhdf reader has the ability to read hdf5 files from + * a byte array or byte buffer. This implementation is better in that it does not require creating + * a temporary file which must be deleted later. However, it could result in memory issues in the + * event of large files. + */ +in = file.fileSystem().openPossiblyCompressedStream(file.split().getPath()); +hdfFile = HdfFile.fromInputStream(in); +reader = new BufferedReader(new InputStreamReader(in)); Review Comment: Good catch, Done. > Convert HDF5 format to EVF2 > --- > > Key: DRILL-8188 > URL: https://issues.apache.org/jira/browse/DRILL-8188 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.20.0 >Reporter: Cong Luo >Assignee: Cong Luo >Priority: Major > > Use EVF V2 instead of old V1. > Also, fixed a few bugs in V2 framework. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2
[ https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523082#comment-17523082 ] ASF GitHub Bot commented on DRILL-8188: --- luocooong commented on code in PR #2515: URL: https://github.com/apache/drill/pull/2515#discussion_r851623160 ## contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java: ## @@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, HDF5FormatConfig formatConfig) } } - public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) { -this.readerConfig = readerConfig; -this.maxRecords = maxRecords; + public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, FileSchemaNegotiator negotiator) { +errorContext = negotiator.parentErrorContext(); +file = negotiator.file(); +readerConfig = config; dataWriters = new ArrayList<>(); -this.showMetadataPreview = readerConfig.formatConfig.showPreview(); - } +showMetadataPreview = readerConfig.formatConfig.showPreview(); - @Override - public boolean open(FileSchemaNegotiator negotiator) { -split = negotiator.split(); -errorContext = negotiator.parentErrorContext(); // Since the HDF file reader uses a stream to actually read the file, the file name from the // module is incorrect. -fileName = split.getPath().getName(); -try { - openFile(negotiator); -} catch (IOException e) { - throw UserException -.dataReadError(e) -.addContext("Failed to close input file: %s", split.getPath()) -.addContext(errorContext) -.build(logger); -} +fileName = file.split().getPath().getName(); -ResultSetLoader loader; -if (readerConfig.defaultPath == null) { - // Get file metadata - List metadata = getFileMetadata(hdfFile, new ArrayList<>()); - metadataIterator = metadata.iterator(); - - // Schema for Metadata query - SchemaBuilder builder = new SchemaBuilder() -.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR) -.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT) -.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT) -.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT) -.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR) -.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR); - - negotiator.tableSchema(builder.buildSchema(), false); - - loader = negotiator.build(); - dimensions = new int[0]; - rowWriter = loader.writer(); - -} else { - // This is the case when the default path is specified. Since the user is explicitly asking for a dataset - // Drill can obtain the schema by getting the datatypes below and ultimately mapping that schema to columns - Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath); - dimensions = dataSet.getDimensions(); - - loader = negotiator.build(); - rowWriter = loader.writer(); - writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(), - negotiator.parentErrorContext()); - if (dimensions.length <= 1) { -buildSchemaFor1DimensionalDataset(dataSet); - } else if (dimensions.length == 2) { -buildSchemaFor2DimensionalDataset(dataSet); - } else { -// Case for datasets of greater than 2D -// These are automatically flattened -buildSchemaFor2DimensionalDataset(dataSet); +{ // Opens an HDF5 file + InputStream in = null; + try { +/* + * As a possible future improvement, the jhdf reader has the ability to read hdf5 files from + * a byte array or byte buffer. This implementation is better in that it does not require creating + * a temporary file which must be deleted later. However, it could result in memory issues in the + * event of large files. + */ +in = file.fileSystem().openPossiblyCompressedStream(file.split().getPath()); +hdfFile = HdfFile.fromInputStream(in); +reader = new BufferedReader(new InputStreamReader(in)); + } catch (IOException e) { +throw UserException + .dataReadError(e) + .message("Failed to open input file: %s", file.split().getPath()) + .addContext(errorContext) + .build(logger); + } finally { +AutoCloseables.closeSilently(in); } } -if (readerConfig.defaultPath == null) { - pathWriter = rowWriter.scalar(PATH_COLUMN_NAME); - dataTypeWriter = rowWriter.scalar(DATA_TYPE_COLUMN_NAME); - fileNameWriter = rowWriter.scalar(FILE_NAME_COLUMN_NAME); - dataSizeWriter = rowWriter.scalar(DATA_SIZE_COLUMN_NAME); - linkWriter = rowWriter.scalar(IS_LINK_COLUMN_NAME);