[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523102#comment-17523102
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851626806


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/ScanSchemaResolver.java:
##
@@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) {
 switch (mode) {
   case FIRST_READER_SCHEMA:
   case READER_SCHEMA:
-if (schema.projectionType() != ProjectionType.ALL) {
+if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) {

Review Comment:
   Let's look at this case :
   1. batch reader received the project column, from ` SELECT path, data_type, 
file_name FROM ` in the constructor().
   2. schema is [TupleSchema [ProjectedColumn [`path` (LATE:REQUIRED)]], 
[ProjectedColumn [`data_type` (LATE:REQUIRED)]], [ProjectedColumn [`file_name` 
(LATE:REQUIRED)]]].
   3. ProjectionType = 'SOME'.
   4. batch reader create new repeated list column in next().
   5. do the project for the schema.
   ```java
   ColumnHandle existing = schema.find(colSchema.name());
if (existing == null) {
 insertColumn(colSchema);
   }
   ```
   6. catch and throw the `IllegalStateException`.
   7. failed to reader the format data.
   
   In EVF1, creating repeated list fields dynamically is allowed and does not 
throw such exceptions.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523100#comment-17523100
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851627884


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/MutableTupleSchema.java:
##
@@ -173,7 +173,7 @@ public ColumnHandle insert(int posn, ColumnMetadata col) {
   }
 
   public ColumnHandle insert(ColumnMetadata col) {
-return insert(insertPoint++, col);
+return insert(insertPoint == -1 ? size() : insertPoint++, col);

Review Comment:
   When I fixed the following error, I received a new one : Dynamically create 
a repeated list field (and success) in next(), then this field may be assigned 
to the index value of -1.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523098#comment-17523098
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851627022


##
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/v3/TestScanOuputSchema.java:
##
@@ -327,4 +375,50 @@ public void 
testStrictProvidedSchemaWithWildcardAndSpecialCols() {
 assertFalse(scan.next());
 scanFixture.close();
   }
+
+  @Test
+  public void testProvidedSchemaWithListArray() {
+TupleMetadata providedSchema = new SchemaBuilder()
+.add("a", MinorType.INT)
+.buildSchema();
+
+BaseScanFixtureBuilder builder = new BaseScanFixtureBuilder(fixture);
+builder.addReader(negotiator -> new MockSimpleReader(negotiator, true));
+builder.builder.providedSchema(providedSchema);
+builder.setProjection(new String[] { "a", "b", "c" });
+builder.builder.nullType(Types.optional(MinorType.VARCHAR));
+ScanFixture scanFixture = builder.build();
+ScanOperatorExec scan = scanFixture.scanOp;
+
+TupleMetadata expectedSchema = new SchemaBuilder()
+.add("a", MinorType.INT)
+.add("b", MinorType.VARCHAR)
+.add("c", MinorType.VARCHAR)
+.addRepeatedList("int_list")
+  .addArray(MinorType.INT)
+  .resumeSchema()
+.addRepeatedList("long_list")

Review Comment:
   These tests are to verify the above fix. Is it better to split into a new 
test class?





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523097#comment-17523097
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851626806


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/v3/schema/ScanSchemaResolver.java:
##
@@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) {
 switch (mode) {
   case FIRST_READER_SCHEMA:
   case READER_SCHEMA:
-if (schema.projectionType() != ProjectionType.ALL) {
+if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) {

Review Comment:
   Let's look at this case :
   1. batch reader received the project column, from ` SELECT path, data_type, 
file_name FROM ` in the constructor().
   2. schema is [TupleSchema [ProjectedColumn [`path` (LATE:REQUIRED)]], 
[ProjectedColumn [`data_type` (LATE:REQUIRED)]], [ProjectedColumn [`file_name` 
(LATE:REQUIRED)]]].
   3. ProjectionType = 'SOME'.
   4. batch reader create new array column in next().
   5. do the project for the schema.
   ```java
   ColumnHandle existing = schema.find(colSchema.name());
if (existing == null) {
 insertColumn(colSchema);
   }
   ```
   6. catch and throw the `IllegalStateException`.
   7. failed to reader the format data.
   
   In EVF1, creating array fields dynamically is allowed and does not throw 
such exceptions.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523093#comment-17523093
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851624724


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file

Review Comment:
   Interesting thing, when we use EVF2, the batch reader only needs a 
constructor to start all initialization operations. But we will write a lot of 
code into that function, initialize the final variables, open files, define the 
schema, and so on. So I used the code blocks to group these actions to make 
them easier to read.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523091#comment-17523091
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623857


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -410,8 +390,22 @@ public boolean next() {
 return true;
   }
 
+  @Override
+  public void close() {
+AutoCloseables.closeSilently(hdfFile, reader);
+/*
+ * The current implementation of the HDF5 reader creates a temp file which 
needs to be removed
+ * when the batch reader is closed. A possible future functionality might 
be to used to.
+ */
+boolean result = hdfFile.getFile().delete();

Review Comment:
   
https://github.com/jamesmudd/jhdf/blob/3136bba6d5c6d6956bea006865c7241b6ddc6f25/jhdf/src/main/java/io/jhdf/HdfFile.java#L167-L169
   In the above code, these temporary files are deleted only before application 
exits.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523086#comment-17523086
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623334


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -410,8 +390,22 @@ public boolean next() {
 return true;
   }
 
+  @Override
+  public void close() {
+AutoCloseables.closeSilently(hdfFile, reader);
+/*
+ * The current implementation of the HDF5 reader creates a temp file which 
needs to be removed
+ * when the batch reader is closed. A possible future functionality might 
be to used to.

Review Comment:
   Done, I added new text for the note.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523083#comment-17523083
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623187


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file
+  InputStream in = null;
+  try {

Review Comment:
   Done, thanks.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523084#comment-17523084
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623209


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file
+  InputStream in = null;
+  try {
+/*
+ * As a possible future improvement, the jhdf reader has the ability 
to read hdf5 files from
+ * a byte array or byte buffer. This implementation is better in that 
it does not require creating
+ * a temporary file which must be deleted later.  However, it could 
result in memory issues in the
+ * event of large files.
+ */
+in = 
file.fileSystem().openPossiblyCompressedStream(file.split().getPath());
+hdfFile = HdfFile.fromInputStream(in);
+reader = new BufferedReader(new InputStreamReader(in));

Review Comment:
   Good catch, Done.





> Convert HDF5 format to EVF2
> ---
>
> Key: DRILL-8188
> URL: https://issues.apache.org/jira/browse/DRILL-8188
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.20.0
>Reporter: Cong Luo
>Assignee: Cong Luo
>Priority: Major
>
> Use EVF V2 instead of old V1.
> Also, fixed a few bugs in V2 framework.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8188) Convert HDF5 format to EVF2

2022-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523082#comment-17523082
 ] 

ASF GitHub Bot commented on DRILL-8188:
---

luocooong commented on code in PR #2515:
URL: https://github.com/apache/drill/pull/2515#discussion_r851623160


##
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java:
##
@@ -171,107 +168,109 @@ public HDF5ReaderConfig(HDF5FormatPlugin plugin, 
HDF5FormatConfig formatConfig)
 }
   }
 
-  public HDF5BatchReader(HDF5ReaderConfig readerConfig, int maxRecords) {
-this.readerConfig = readerConfig;
-this.maxRecords = maxRecords;
+  public HDF5BatchReader(HDF5ReaderConfig config, EasySubScan scan, 
FileSchemaNegotiator negotiator) {
+errorContext = negotiator.parentErrorContext();
+file = negotiator.file();
+readerConfig = config;
 dataWriters = new ArrayList<>();
-this.showMetadataPreview = readerConfig.formatConfig.showPreview();
-  }
+showMetadataPreview = readerConfig.formatConfig.showPreview();
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {
-split = negotiator.split();
-errorContext = negotiator.parentErrorContext();
 // Since the HDF file reader uses a stream to actually read the file, the 
file name from the
 // module is incorrect.
-fileName = split.getPath().getName();
-try {
-  openFile(negotiator);
-} catch (IOException e) {
-  throw UserException
-.dataReadError(e)
-.addContext("Failed to close input file: %s", split.getPath())
-.addContext(errorContext)
-.build(logger);
-}
+fileName = file.split().getPath().getName();
 
-ResultSetLoader loader;
-if (readerConfig.defaultPath == null) {
-  // Get file metadata
-  List metadata = getFileMetadata(hdfFile, new 
ArrayList<>());
-  metadataIterator = metadata.iterator();
-
-  // Schema for Metadata query
-  SchemaBuilder builder = new SchemaBuilder()
-.addNullable(PATH_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_TYPE_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(FILE_NAME_COLUMN_NAME, MinorType.VARCHAR)
-.addNullable(DATA_SIZE_COLUMN_NAME, MinorType.BIGINT)
-.addNullable(IS_LINK_COLUMN_NAME, MinorType.BIT)
-.addNullable(ELEMENT_COUNT_NAME, MinorType.BIGINT)
-.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR)
-.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR);
-
-  negotiator.tableSchema(builder.buildSchema(), false);
-
-  loader = negotiator.build();
-  dimensions = new int[0];
-  rowWriter = loader.writer();
-
-} else {
-  // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
-  // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-  Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
-  dimensions = dataSet.getDimensions();
-
-  loader = negotiator.build();
-  rowWriter = loader.writer();
-  writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
-  negotiator.parentErrorContext());
-  if (dimensions.length <= 1) {
-buildSchemaFor1DimensionalDataset(dataSet);
-  } else if (dimensions.length == 2) {
-buildSchemaFor2DimensionalDataset(dataSet);
-  } else {
-// Case for datasets of greater than 2D
-// These are automatically flattened
-buildSchemaFor2DimensionalDataset(dataSet);
+{ // Opens an HDF5 file
+  InputStream in = null;
+  try {
+/*
+ * As a possible future improvement, the jhdf reader has the ability 
to read hdf5 files from
+ * a byte array or byte buffer. This implementation is better in that 
it does not require creating
+ * a temporary file which must be deleted later.  However, it could 
result in memory issues in the
+ * event of large files.
+ */
+in = 
file.fileSystem().openPossiblyCompressedStream(file.split().getPath());
+hdfFile = HdfFile.fromInputStream(in);
+reader = new BufferedReader(new InputStreamReader(in));
+  } catch (IOException e) {
+throw UserException
+  .dataReadError(e)
+  .message("Failed to open input file: %s", file.split().getPath())
+  .addContext(errorContext)
+  .build(logger);
+  } finally {
+AutoCloseables.closeSilently(in);
   }
 }
-if (readerConfig.defaultPath == null) {
-  pathWriter = rowWriter.scalar(PATH_COLUMN_NAME);
-  dataTypeWriter = rowWriter.scalar(DATA_TYPE_COLUMN_NAME);
-  fileNameWriter = rowWriter.scalar(FILE_NAME_COLUMN_NAME);
-  dataSizeWriter = rowWriter.scalar(DATA_SIZE_COLUMN_NAME);
-  linkWriter = rowWriter.scalar(IS_LINK_COLUMN_NAME);