NLP is not something I'm familiar with. If your analysis works with Arrow or Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then it would likely benefit you to use Arrow up front instead of converting the data down the line. (For example, HuggingFace datasets use Arrow partly for its interoperability with other tools [1].)
[1]: https://huggingface.co/docs/datasets/about_arrow On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote: > Hi David, > Thanks for the questions… > Are these two processes on the same machine: > No, two different processes on different machines > > What exactly is the unstructured text > The text is the textual content of normal documents that enterprises have > such as pdf docx files. I can split these into chunks before transferring if > needed. > > What is the python side planning to do: > Analyze and run ML models such as NLP on the text. > Sitaraman > *From: *David Li <[email protected]> > *Date: *Thursday, March 2, 2023 at 5:33 PM > *To: *dl <[email protected]> > *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large > volumes of unstructured text > ***** EXTERNAL EMAIL ***** > Possibly, but more details might help. Are these two processes on the same > machine, two components in the same process, two processes on different > machines? What exactly is the unstructured text - does it at least fit into a > column of data, or is it literally just a stream of text with no further > structure? What is the Python side planning to do with the text (for > instance, do you want to further analyze it with something like Pandas)? > > On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote: >> Hi, >> >> My use case is the need to efficiently transport large volumes of >> unstructured text from a module in Java to a module in Python with possibly >> a massaging of the docs before transport. Is Arrow Flight/Arrow the right >> choice for this? Why Why not? Any advice appreciated. >> >> Thanks >> >> Sitaraman >> >
