If you are always just going to convert back to string, then I don't see why
you wouldn't just use HTTP.
On Thu, Mar 2, 2023, at 21:50, Vilayannur Sitaraman wrote:
> I am primarily concerned about the efficient transfer of such texts across
> machines and across different programming languages….I am comfortable
> treating a text chunk as a VarCharVector with the textual content like below,
> and which I can then transfer to my NLP module, for further processing. But
> I want to get expert opinion on if this is the right way to handle this
> requirement. Or are there more efficient ways of doing the transfer than
> converting to arrow first, doing the transfer and then converting back to
> string for further processing.
> Thanks for your thoughts and considered opinion on this.
> VarCharVector stateVector = (VarCharVector)
> vectorSchemaRoot.getVector("state");
> stateVector.allocateNew(textlines.size());
> int k=0;
> for ( String thisStr: textlines) {
> //nameVector.set(i, stateStr.getBytes());
> stateVector.set(k, thisStr.getBytes());
> k++;
> }
> //System.out.println("i in state is " + i + " " + stateStr);
> //vectorSchemaRoot.setRowCount(i+1);
> vectorSchemaRoot.setRowCount(textlines.size());
> clientStreamListener.start(vectorSchemaRoot);
> clientStreamListener.putNext();
> clientStreamListener.completed();
> System.*out*.println(vectorSchemaRoot.getRowCount());
> Sitaraman
> *From: *David Li <[email protected]>
> *Date: *Thursday, March 2, 2023 at 6:03 PM
> *To: *dl <[email protected]>
> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large
> volumes of unstructured text
> ***** EXTERNAL EMAIL *****
> NLP is not something I'm familiar with. If your analysis works with Arrow or
> Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then it
> would likely benefit you to use Arrow up front instead of converting the data
> down the line. (For example, HuggingFace datasets use Arrow partly for its
> interoperability with other tools [1].)
>
> [1]: https://huggingface.co/docs/datasets/about_arrow
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdatasets%2Fabout_arrow&data=05%7C01%7Cvilayannur.sitaraman%40hitachivantara.com%7C12c65cf8c69b4f00505d08db1b8b8714%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C638134058315276423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mB0gM3G3DIV4zupyYuurKF96ExSHmMk3z%2FLhKspCucg%3D&reserved=0>
>
> On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote:
>> Hi David,
>>
>> Thanks for the questions…
>>
>> Are these two processes on the same machine:
>>
>> No, two different processes on different machines
>>
>>
>>
>> What exactly is the unstructured text
>>
>> The text is the textual content of normal documents that enterprises have
>> such as pdf docx files. I can split these into chunks before transferring
>> if needed.
>>
>>
>>
>> What is the python side planning to do:
>>
>> Analyze and run ML models such as NLP on the text.
>>
>> Sitaraman
>>
>> *From: *David Li <[email protected]>
>> *Date: *Thursday, March 2, 2023 at 5:33 PM
>> *To: *dl <[email protected]>
>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large
>> volumes of unstructured text
>>
>> ***** EXTERNAL EMAIL *****
>>
>> Possibly, but more details might help. Are these two processes on the same
>> machine, two components in the same process, two processes on different
>> machines? What exactly is the unstructured text - does it at least fit into
>> a column of data, or is it literally just a stream of text with no further
>> structure? What is the Python side planning to do with the text (for
>> instance, do you want to further analyze it with something like Pandas)?
>>
>>
>>
>> On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote:
>>
>>> Hi,
>>>
>>> My use case is the need to efficiently transport large volumes of
>>> unstructured text from a module in Java to a module in Python with possibly
>>> a massaging of the docs before transport. Is Arrow Flight/Arrow the right
>>> choice for this? Why Why not? Any advice appreciated.
>>>
>>> Thanks
>>>
>>> Sitaraman
>>>
>>
>>
>