davisusanibar commented on PR #14203:
URL: https://github.com/apache/arrow/pull/14203#issuecomment-1260326584

   Tested on `Windows 10 Home`:
   
   1.- Download new jar Dataset / C Data locally from 
https://github.com/ursacomputing/crossbow/releases/tag/actions-9be4b55dea-github-java-jars
   
   2.- Test new DLL created:
   ````
   # Dataset DLL
   $ cygcheck.exe 'arrow_dataset_jni.dll'
     C:\Windows\system32\WINHTTP.dll
       C:\Windows\system32\ntdll.dll
       C:\Windows\system32\KERNELBASE.dll
     C:\Windows\system32\bcrypt.dll
     C:\Windows\system32\WININET.dll
       C:\Windows\system32\msvcrt.dll
     C:\Windows\system32\USERENV.dll
       C:\Windows\system32\RPCRT4.dll
     C:\Windows\system32\VERSION.dll
       C:\Windows\system32\KERNEL32.dll
     C:\Windows\system32\WS2_32.dll
     C:\Windows\system32\SHELL32.dll
       C:\Windows\system32\msvcp_win.dll
       C:\Windows\system32\USER32.dll
         C:\Windows\system32\win32u.dll
         C:\Windows\system32\GDI32.dll
     C:\Windows\system32\ole32.dll
       C:\Windows\system32\combase.dll
     C:\Windows\system32\ADVAPI32.dll
       C:\Windows\system32\SECHOST.dll
     C:\Windows\system32\MSVCP140.dll
       C:\Windows\system32\VCRUNTIME140.dll
       C:\Windows\system32\VCRUNTIME140_1.dll
   
   # C Data Interface DLL
   $ cygcheck.exe 'arrow_cdata_jni.dll'
     C:\Windows\system32\MSVCP140.dll
       C:\Windows\system32\VCRUNTIME140.dll
         C:\Windows\system32\KERNEL32.dll
           C:\Windows\system32\ntdll.dll
           C:\Windows\system32\KERNELBASE.dll
       C:\Windows\system32\VCRUNTIME140_1.dll
   ````
   If you see errors try to install 
https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170
   
   3.- Install new jar Dataset / C Data locally:
   ````
   # intall dataset manually
   mvn install:install-file 
-Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-dataset-10.0.0-SNAPSHOT.pom"
 -DgroupId="org.apache.arrow" -DartifactId="arrow-dataset" 
-Dversion="10.0.0-SNAPSHOT" -Dpackaging="pom"
   mvn install:install-file 
-Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-dataset-10.0.0-SNAPSHOT.jar"
 -DgroupId="org.apache.arrow" -DartifactId="arrow-dataset" 
-Dversion="10.0.0-SNAPSHOT" -Dpackaging="jar"
   # install c data interface manually
   mvn install:install-file 
-Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-c-data-10.0.0-SNAPSHOT.pom"
 -DgroupId="org.apache.arrow" -DartifactId="arrow-c-data" 
-Dversion="10.0.0-SNAPSHOT" -Dpackaging="pom"
   mvn install:install-file 
-Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-c-data-10.0.0-SNAPSHOT.jar"
 -DgroupId="org.apache.arrow" -DartifactId="arrow-c-data" 
-Dversion="10.0.0-SNAPSHOT" -Dpackaging="jar"
   ````
   
   4.- Add new Dataset / C Data Interface dependencies into your project 
(Maven/Gradle)
   
   5.- Create Dataset with mew Dataset jar that contains DLL 
arrow_dataset_jni.dll + Read RecordBatches with new C Data Interface that 
contains DLL arrow_cdata_jni.dll:
   ````
   import org.apache.arrow.dataset.file.FileFormat;
   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
   import org.apache.arrow.dataset.jni.NativeMemoryPool;
   import org.apache.arrow.dataset.scanner.ScanOptions;
   import org.apache.arrow.dataset.scanner.Scanner;
   import org.apache.arrow.dataset.source.Dataset;
   import org.apache.arrow.dataset.source.DatasetFactory;
   import org.apache.arrow.memory.BufferAllocator;
   import org.apache.arrow.memory.RootAllocator;
   import org.apache.arrow.vector.VectorSchemaRoot;
   import org.apache.arrow.vector.ipc.ArrowReader;
   
   import java.io.IOException;
   import java.net.URISyntaxException;
   
   public class Recipe {
       public static void main(String[] args) throws URISyntaxException {
           // File at: 
https://github.com/apache/arrow-cookbook/blob/main/java/thirdpartydeps/parquetfiles/data1.parquet
           String uri = 
"file:///C:\\Users\\dsusanibar\\IdeaProjects\\win-cookbooks\\src\\main\\resources\\files\\data1.parquet";
           ScanOptions options = new ScanOptions(/*batchSize*/ 5);
           try (
               BufferAllocator allocator = new RootAllocator();
               DatasetFactory datasetFactory = new 
FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), 
FileFormat.PARQUET, uri);
               Dataset dataset = datasetFactory.finish();
               Scanner scanner = dataset.newScan(options)
           ) {
               scanner.scan().forEach(scanTask -> {
                   try (ArrowReader reader = scanTask.execute()) {
                       while (reader.loadNextBatch()) {
                           final int[] count = {1};
                           try (VectorSchemaRoot root = 
reader.getVectorSchemaRoot()) {
                               System.out.println("Number of rows per batch["+ 
count[0]++ +"]: " + root.getRowCount());
                               System.out.println(root.contentToTSVString());
                           }
                       }
                   } catch (IOException e) {
                       e.printStackTrace();
                   }
               });
           } catch (Exception e) {
               e.printStackTrace();
           }
       }
   }
   
   Result:
   Number of rows per batch[1]: 3
   id   name
   1    David
   2    Gladis
   3    Juan
   ````
   Thanks a lot @kou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to