Re: Is ArrowFlight/Arrow the right choice for transporting large volumes of unstructured text

David Li Fri, 03 Mar 2023 05:14:15 -0800

If you are always just going to convert back to string, then I don't see why 
you wouldn't just use HTTP.


On Thu, Mar 2, 2023, at 21:50, Vilayannur Sitaraman wrote:
> I am primarily concerned about the efficient transfer of such texts across 
> machines and across different programming languages….I am comfortable 
> treating a text chunk as a VarCharVector with the textual content like below, 
> and which I can then transfer to my NLP module,  for further processing.  But 
> I want to get expert opinion on if this is the right way to handle this 
> requirement.  Or are there more efficient ways of doing the transfer than 
> converting to arrow first, doing the transfer and then converting back to  
> string for further processing.
> Thanks for your thoughts and considered opinion on this.
> VarCharVector stateVector = (VarCharVector) 
> vectorSchemaRoot.getVector("state");
> stateVector.allocateNew(textlines.size());
> int k=0;
> for ( String thisStr: textlines) {
>   //nameVector.set(i, stateStr.getBytes());
>   stateVector.set(k, thisStr.getBytes());
>   k++;
> }
>   //System.out.println("i in state is " + i + " " + stateStr);
>   //vectorSchemaRoot.setRowCount(i+1);
>   vectorSchemaRoot.setRowCount(textlines.size());
>   clientStreamListener.start(vectorSchemaRoot);
>   clientStreamListener.putNext();
>   clientStreamListener.completed();
>   System.*out*.println(vectorSchemaRoot.getRowCount());
> Sitaraman
> *From: *David Li <[email protected]>
> *Date: *Thursday, March 2, 2023 at 6:03 PM
> *To: *dl <[email protected]>
> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
> volumes of unstructured text
> ***** EXTERNAL EMAIL *****
> NLP is not something I'm familiar with. If your analysis works with Arrow or 
> Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then it 
> would likely benefit you to use Arrow up front instead of converting the data 
> down the line. (For example, HuggingFace datasets use Arrow partly for its 
> interoperability with other tools [1].) 
>  
> [1]: https://huggingface.co/docs/datasets/about_arrow 
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdatasets%2Fabout_arrow&data=05%7C01%7Cvilayannur.sitaraman%40hitachivantara.com%7C12c65cf8c69b4f00505d08db1b8b8714%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C638134058315276423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mB0gM3G3DIV4zupyYuurKF96ExSHmMk3z%2FLhKspCucg%3D&reserved=0>
>  
> On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote:
>> Hi David,
>> 
>> Thanks for the questions…
>> 
>> Are these two processes on the same machine:
>> 
>> No, two different processes on different machines
>> 
>>  
>> 
>> What exactly is the unstructured text
>> 
>> The text is the textual content of normal documents that enterprises have 
>> such as pdf docx files.  I can split these into chunks before transferring 
>> if needed. 
>> 
>>  
>> 
>> What is the python side planning to do:
>> 
>> Analyze and run ML models such as NLP on the text.
>> 
>> Sitaraman
>> 
>> *From: *David Li <[email protected]>
>> *Date: *Thursday, March 2, 2023 at 5:33 PM
>> *To: *dl <[email protected]>
>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
>> volumes of unstructured text
>> 
>> ***** EXTERNAL EMAIL *****
>> 
>> Possibly, but more details might help. Are these two processes on the same 
>> machine, two components in the same process, two processes on different 
>> machines? What exactly is the unstructured text - does it at least fit into 
>> a column of data, or is it literally just a stream of text with no further 
>> structure? What is the Python side planning to do with the text (for 
>> instance, do you want to further analyze it with something like Pandas)?
>> 
>>  
>> 
>> On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote:
>> 
>>> Hi,
>>> 
>>>   My use case is the need to efficiently transport large volumes of 
>>> unstructured text from a module in Java to a module in Python with possibly 
>>> a massaging of the docs before transport. Is Arrow Flight/Arrow the right 
>>> choice for this?  Why Why not?  Any advice appreciated.
>>> 
>>> Thanks
>>> 
>>> Sitaraman
>>> 
>>  
>> 
>

Re: Is ArrowFlight/Arrow the right choice for transporting large volumes of unstructured text

Reply via email to