[GitHub] [drill] jamesmudd commented on a change in pull request #2164: DRILL-7858: Drill Crashes with Certain HDF5 Files

GitBox Wed, 24 Feb 2021 12:36:41 -0800


jamesmudd commented on a change in pull request #2164:
URL: https://github.com/apache/drill/pull/2164#discussion_r582255769




##########
File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
##########
@@ -198,27 +203,27 @@ public boolean open(FileSchemaNegotiator negotiator) {
       negotiator.tableSchema(builder.buildSchema(), false);
 
       loader = negotiator.build();
-      dimensions = new long[0];
+      dimensions = new int[0];
       rowWriter = loader.writer();
 
     } else {
       // This is the case when the default path is specified. Since the user 
is explicitly asking for a dataset
       // Drill can obtain the schema by getting the datatypes below and 
ultimately mapping that schema to columns
-      HDF5DataSetInformation dsInfo = 
hdf5Reader.object().getDataSetInformation(readerConfig.defaultPath);
-      dimensions = dsInfo.getDimensions();
+      Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath);
+      dimensions = dataSet.getDimensions();
 
       loader = negotiator.build();
       rowWriter = loader.writer();
       writerSpec = new WriterSpec(rowWriter, negotiator.providedSchema(),
           negotiator.parentErrorContext());
       if (dimensions.length <= 1) {

Review comment:
       Have you tested with scalar datasets?

##########
File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
##########
@@ -237,37 +242,37 @@ public boolean open(FileSchemaNegotiator negotiator) {
    * This function is called when the default path is set and the data set is 
a single dimension.
    * This function will create an array of one dataWriter of the
    * correct datatype
-   * @param dsInfo The HDF5 dataset information
+   * @param dataset The HDF5 dataset
    */
-  private void buildSchemaFor1DimensionalDataset(HDF5DataSetInformation 
dsInfo) {
-    TypeProtos.MinorType currentDataType = HDF5Utils.getDataType(dsInfo);
+  private void buildSchemaFor1DimensionalDataset(Dataset dataset) {
+    MinorType currentDataType = HDF5Utils.getDataType(dataset.getDataType());
 
     // Case for null or unknown data types:
     if (currentDataType == null) {
-      logger.warn("Couldn't add {}", 
dsInfo.getTypeInformation().tryGetJavaType().toGenericString());
+      logger.warn("Couldn't add {}", dataset.getJavaType().getName());
       return;
     }
     dataWriters.add(buildWriter(currentDataType));
   }
 
-  private HDF5DataWriter buildWriter(TypeProtos.MinorType dataType) {
+  private HDF5DataWriter buildWriter(MinorType dataType) {
     switch (dataType) {
-      case GENERIC_OBJECT:
-        return new HDF5EnumDataWriter(hdf5Reader, writerSpec, 
readerConfig.defaultPath);
+      /*case GENERIC_OBJECT:
+        return new HDF5EnumDataWriter(hdfFile, writerSpec, 
readerConfig.defaultPath);*/

Review comment:
       Possibly for HDF5 opaque type? Not currently supported by jhdf but will 
be added soon.

##########
File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
##########
@@ -323,53 +328,20 @@ private void openFile(FileSchemaNegotiator negotiator) 
throws IOException {
     InputStream in = null;
     try {
       in = 
negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
-      IHDF5Factory factory = HDF5FactoryProvider.get();
-      inFile = convertInputStreamToFile(in);
-      hdf5Reader = factory.openForReading(inFile);
+      hdfFile = HdfFile.fromInputStream(in);

Review comment:
       Just need to be a bit careful here. Currently the implementation of this 
is reading the stream to a temp file, the temp file is cleaned up when the JVM 
exits. Depending on how your app will run that could eventually be a problem if 
you load enough files to run out of disk space. I want to change this so the 
temp file will be cleaned up when the `HdfFile` is closed or when the JVM 
exits. If you application is better suited to holding files in memory take a 
look at this issue https://github.com/jamesmudd/jhdf/issues/245 which allows in 
memory files.

##########
File path: 
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5Utils.java
##########
@@ -265,4 +185,126 @@ public static String getNameFromPath(String path) {
       return "";
     }
   }
+
+  public static Object[] toMatrix(Object[] inputArray) {
+    return flatten(inputArray).toArray();
+  }
+
+  public static boolean[][] toBooleanMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((boolean[][])input[0]).length;
+
+    boolean[][] result = new boolean[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      boolean[] row = (boolean[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+  public static byte[][] toByteMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((byte[])input[0]).length;
+
+    byte[][] result = new byte[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      byte[] row = (byte[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+  public static short[][] toShortMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((short[])input[0]).length;
+
+    short[][] result = new short[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      short[] row = (short[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+
+  public static int[][] toIntMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((int[])input[0]).length;
+
+    int[][] result = new int[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      int[] row = (int[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+  public static long[][] toLongMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((long[])input[0]).length;
+
+    long[][] result = new long[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      long[] row = (long[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+  public static float[][] toFloatMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((float[])input[0]).length;
+
+    float[][] result = new float[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      float[] row = (float[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+  public static double[][] toDoubleMatrix(Object[] inputArray) {
+    Object[] input = flatten(inputArray).toArray();
+    int rows = input.length;
+    int cols = ((double[])input[0]).length;
+
+    double[][] result = new double[cols][rows];
+
+    for (int i = 0; i <  rows; i++) {
+      double[] row = (double[])input[i];
+      for (int j = 0; j < cols; j++) {
+        result[j][i] = row[j];
+      }
+    }
+    return result;
+  }
+
+  public static Stream<Object> flatten(Object[] array) {
+    return Arrays.stream(array)
+      .flatMap(o -> o instanceof Object[]? flatten((Object[])o): Stream.of(o));
+  }

Review comment:
       Thats neat




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] jamesmudd commented on a change in pull request #2164: DRILL-7858: Drill Crashes with Certain HDF5 Files

Reply via email to