davisusanibar commented on PR #14203: URL: https://github.com/apache/arrow/pull/14203#issuecomment-1260326584
Tested on `Windows 10 Home`: 1.- Download new jar Dataset / C Data locally from https://github.com/ursacomputing/crossbow/releases/tag/actions-9be4b55dea-github-java-jars 2.- Test new DLL created: ```` # Dataset DLL $ cygcheck.exe 'arrow_dataset_jni.dll' C:\Windows\system32\WINHTTP.dll C:\Windows\system32\ntdll.dll C:\Windows\system32\KERNELBASE.dll C:\Windows\system32\bcrypt.dll C:\Windows\system32\WININET.dll C:\Windows\system32\msvcrt.dll C:\Windows\system32\USERENV.dll C:\Windows\system32\RPCRT4.dll C:\Windows\system32\VERSION.dll C:\Windows\system32\KERNEL32.dll C:\Windows\system32\WS2_32.dll C:\Windows\system32\SHELL32.dll C:\Windows\system32\msvcp_win.dll C:\Windows\system32\USER32.dll C:\Windows\system32\win32u.dll C:\Windows\system32\GDI32.dll C:\Windows\system32\ole32.dll C:\Windows\system32\combase.dll C:\Windows\system32\ADVAPI32.dll C:\Windows\system32\SECHOST.dll C:\Windows\system32\MSVCP140.dll C:\Windows\system32\VCRUNTIME140.dll C:\Windows\system32\VCRUNTIME140_1.dll # C Data Interface DLL $ cygcheck.exe 'arrow_cdata_jni.dll' C:\Windows\system32\MSVCP140.dll C:\Windows\system32\VCRUNTIME140.dll C:\Windows\system32\KERNEL32.dll C:\Windows\system32\ntdll.dll C:\Windows\system32\KERNELBASE.dll C:\Windows\system32\VCRUNTIME140_1.dll ```` If you see errors try to install https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170 3.- Install new jar Dataset / C Data locally: ```` # intall dataset manually mvn install:install-file -Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-dataset-10.0.0-SNAPSHOT.pom" -DgroupId="org.apache.arrow" -DartifactId="arrow-dataset" -Dversion="10.0.0-SNAPSHOT" -Dpackaging="pom" mvn install:install-file -Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-dataset-10.0.0-SNAPSHOT.jar" -DgroupId="org.apache.arrow" -DartifactId="arrow-dataset" -Dversion="10.0.0-SNAPSHOT" -Dpackaging="jar" # install c data interface manually mvn install:install-file -Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-c-data-10.0.0-SNAPSHOT.pom" -DgroupId="org.apache.arrow" -DartifactId="arrow-c-data" -Dversion="10.0.0-SNAPSHOT" -Dpackaging="pom" mvn install:install-file -Dfile="C:\Users\dsusanibar\IdeaProjects\win-cookbooks\src\main\resources\files\arrow-c-data-10.0.0-SNAPSHOT.jar" -DgroupId="org.apache.arrow" -DartifactId="arrow-c-data" -Dversion="10.0.0-SNAPSHOT" -Dpackaging="jar" ```` 4.- Add new Dataset / C Data Interface dependencies into your project (Maven/Gradle) 5.- Create Dataset with mew Dataset jar that contains DLL arrow_dataset_jni.dll + Read RecordBatches with new C Data Interface that contains DLL arrow_cdata_jni.dll: ```` import org.apache.arrow.dataset.file.FileFormat; import org.apache.arrow.dataset.file.FileSystemDatasetFactory; import org.apache.arrow.dataset.jni.NativeMemoryPool; import org.apache.arrow.dataset.scanner.ScanOptions; import org.apache.arrow.dataset.scanner.Scanner; import org.apache.arrow.dataset.source.Dataset; import org.apache.arrow.dataset.source.DatasetFactory; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.ipc.ArrowReader; import java.io.IOException; import java.net.URISyntaxException; public class Recipe { public static void main(String[] args) throws URISyntaxException { // File at: https://github.com/apache/arrow-cookbook/blob/main/java/thirdpartydeps/parquetfiles/data1.parquet String uri = "file:///C:\\Users\\dsusanibar\\IdeaProjects\\win-cookbooks\\src\\main\\resources\\files\\data1.parquet"; ScanOptions options = new ScanOptions(/*batchSize*/ 5); try ( BufferAllocator allocator = new RootAllocator(); DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri); Dataset dataset = datasetFactory.finish(); Scanner scanner = dataset.newScan(options) ) { scanner.scan().forEach(scanTask -> { try (ArrowReader reader = scanTask.execute()) { while (reader.loadNextBatch()) { final int[] count = {1}; try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) { System.out.println("Number of rows per batch["+ count[0]++ +"]: " + root.getRowCount()); System.out.println(root.contentToTSVString()); } } } catch (IOException e) { e.printStackTrace(); } }); } catch (Exception e) { e.printStackTrace(); } } } Result: Number of rows per batch[1]: 3 id name 1 David 2 Gladis 3 Juan ```` Thanks a lot @kou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
