Hi Shawn,

Thank you for taking the time to reply to my email.  I appreciate the
links, but the Apache Arrow docs were what I referenced in my first
attempts.  I then used this question
<https://stackoverflow.com/questions/70082247/how-to-use-apache-arrow-to-write-files-in-parquet-format-on-windows-using-c?noredirect=1&lq=1>
on Stack Overflow, as the author gave significantly more detail as to how
they went about making their build; things like, after using cmake, you
need to build the INSTALL.vcproj in Visual Studio to get header files and
.lib's installed to your local directory.  I had no idea this was necessary
as it was never mentioned on the Apache website and this is the first time
I have ever needed to build a C++ library. Though the Apache website caters
to many different types of users using many different environments, it was
certainly not written for anyone who just has a background in coding but
not development.

The Python documentation for pyarrow I have found to be perfectly
serviceable.  I prefer Python as my language of choice (versus mostly R,
Matlab, Fortran, or Visual Basic) and using the pyarrow package in the past
is how I know I would like to save my data in the Parquet format. I
absolutely would use Python for this project, but I need to read a file,
organize the data, then save as Parquet. My preliminary checks showed that
C++ was about 5x faster than Python (admittedly, this was for writing a CSV
file not Parquet, but my only assumption can be that Parquet would be
similar). I have hundreds of terabytes of data to parse, so that "5x
faster" is a deal breaker.

Thank you again,
David

On Sun, Mar 6, 2022 at 11:48 PM Shawn Zeng <[email protected]> wrote:

> I guess you can refer to the docs at
> https://arrow.apache.org/docs/cpp/build_system.html and
> https://arrow.apache.org/docs/cpp/parquet.html.
>
> When I first use Arrow, I do also feel the C++ doc is not as comprehensive
> as the Python one. You can also use pyarrow since it is just a wrapper on
> the C++ implementation. It is much simpler and the performance is almost
> the same.
>
> On Fri, Mar 4, 2022 at 4:54 PM David Griffin <[email protected]>
> wrote:
>
>> I hope this email finds you well and that this is the proper address for
>> my question. First off, let me say I am not a programmer by training and
>> have very little experience in C++. However, it is the best option for what
>> I'm doing, which is ultimately writing data in Parquet format. I can save
>> my data as CSV right now no problem, but I cannot get even the simplest
>> example code for arrow and/or parquet to work for me.  And unfortunately, I
>> have no idea what I'm doing when it comes to troubleshooting a manual build
>> of libraries in C++.  I've tried asking Stack Overflow and Reddit, but so
>> far, I haven't gotten any responses.  I've read as much as I could find on
>> those sites and the general internet at large, but I still can't figure out
>> how to properly access Apache libraries in my own C++ project.
>>
>> I'd be happy to go through everything I've done and the issues I am
>> running into, but I was wondering first if there was any location online
>> that walks completely through the set up of the Apache Arrow and Apache
>> Parquet libraries (ideally in Windows 10, but I can make an Ubuntu
>> partition if necessary) to see if there is something I missed when I did it
>> myself? Even just a nudge in the right direction would be appreciated.
>>
>> Thank you for taking the time to read my email and awesome job on this
>> technology.
>>
>> -David
>>
>> --
>>
>>

--

Reply via email to