Hi Shawn, Thank you for taking the time to reply to my email. I appreciate the links, but the Apache Arrow docs were what I referenced in my first attempts. I then used this question <https://stackoverflow.com/questions/70082247/how-to-use-apache-arrow-to-write-files-in-parquet-format-on-windows-using-c?noredirect=1&lq=1> on Stack Overflow, as the author gave significantly more detail as to how they went about making their build; things like, after using cmake, you need to build the INSTALL.vcproj in Visual Studio to get header files and .lib's installed to your local directory. I had no idea this was necessary as it was never mentioned on the Apache website and this is the first time I have ever needed to build a C++ library. Though the Apache website caters to many different types of users using many different environments, it was certainly not written for anyone who just has a background in coding but not development.
The Python documentation for pyarrow I have found to be perfectly serviceable. I prefer Python as my language of choice (versus mostly R, Matlab, Fortran, or Visual Basic) and using the pyarrow package in the past is how I know I would like to save my data in the Parquet format. I absolutely would use Python for this project, but I need to read a file, organize the data, then save as Parquet. My preliminary checks showed that C++ was about 5x faster than Python (admittedly, this was for writing a CSV file not Parquet, but my only assumption can be that Parquet would be similar). I have hundreds of terabytes of data to parse, so that "5x faster" is a deal breaker. Thank you again, David On Sun, Mar 6, 2022 at 11:48 PM Shawn Zeng <[email protected]> wrote: > I guess you can refer to the docs at > https://arrow.apache.org/docs/cpp/build_system.html and > https://arrow.apache.org/docs/cpp/parquet.html. > > When I first use Arrow, I do also feel the C++ doc is not as comprehensive > as the Python one. You can also use pyarrow since it is just a wrapper on > the C++ implementation. It is much simpler and the performance is almost > the same. > > On Fri, Mar 4, 2022 at 4:54 PM David Griffin <[email protected]> > wrote: > >> I hope this email finds you well and that this is the proper address for >> my question. First off, let me say I am not a programmer by training and >> have very little experience in C++. However, it is the best option for what >> I'm doing, which is ultimately writing data in Parquet format. I can save >> my data as CSV right now no problem, but I cannot get even the simplest >> example code for arrow and/or parquet to work for me. And unfortunately, I >> have no idea what I'm doing when it comes to troubleshooting a manual build >> of libraries in C++. I've tried asking Stack Overflow and Reddit, but so >> far, I haven't gotten any responses. I've read as much as I could find on >> those sites and the general internet at large, but I still can't figure out >> how to properly access Apache libraries in my own C++ project. >> >> I'd be happy to go through everything I've done and the issues I am >> running into, but I was wondering first if there was any location online >> that walks completely through the set up of the Apache Arrow and Apache >> Parquet libraries (ideally in Windows 10, but I can make an Ubuntu >> partition if necessary) to see if there is something I missed when I did it >> myself? Even just a nudge in the right direction would be appreciated. >> >> Thank you for taking the time to read my email and awesome job on this >> technology. >> >> -David >> >> -- >> >> --
