Re: Is ArrowFlight/Arrow the right choice for transporting large volumes of unstructured text

David Li Fri, 03 Mar 2023 13:38:25 -0800

HTTP/2 is not a magic 'make things faster' button (and due to head-of-line 
blocking, it can be slower than HTTP/1 in some circumstances!), plus you can 
use HTTP/2 outside of Flight/gRPC. (Or you could just use gRPC itself, though 
by default gRPC is going to force some extra copies on you.)


Flight encourages parallelization and separation of control/data, but of course 
you can implement those things yourself. 

I would still encourage you to try Arrow/Flight, especially as many frameworks 
in this space do interoperate with Arrow, but I'm not sure I'd treat Arrow as 
just a way to accelerate your network transfers. 

On Fri, Mar 3, 2023, at 16:33, Vilayannur Sitaraman wrote:
> Hi David,
>    I was of the belief that gRPC which uses http2 as transport is more 
> efficient.  Plus the benefits listed as pasted below for Arrow Flight?
> “
> We wanted Flight to enable systems to create horizontally scalable data 
> services without having to deal with such bottlenecks. A client request to a 
> dataset using the GetFlightInfo RPC returns a list of *endpoints*, each of 
> which contains a server location and a *ticket* to send that server in a 
> DoGet request to obtain a part of the full dataset. To get access to the 
> entire dataset, all of the endpoints must be consumed. While Flight streams 
> are not necessarily ordered, we provide for application-defined metadata 
> which can be used to serialize ordering information.
> This multiple-endpoint pattern has a number of benefits:
>  • Endpoints can be read by clients in parallel.
>  • The service that serves the GetFlightInfo “planning” request can delegate 
> work to sibling services to take advantage of data locality or simply to help 
> with load balancing.
>  • Nodes in a distributed cluster can take on different roles. For example, a 
> subset of nodes might be responsible for planning queries while other nodes 
> exclusively fulfill data stream (DoGet or DoPut) requests.
>  
>  
> “
>  
> *From: *David Li <[email protected]>
> *Date: *Friday, March 3, 2023 at 5:15 AM
> *To: *dl <[email protected]>
> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
> volumes of unstructured text
> ***** EXTERNAL EMAIL *****
> If you are always just going to convert back to string, then I don't see why 
> you wouldn't just use HTTP.
>  
> On Thu, Mar 2, 2023, at 21:50, Vilayannur Sitaraman wrote:
>> I am primarily concerned about the efficient transfer of such texts across 
>> machines and across different programming languages….I am comfortable 
>> treating a text chunk as a VarCharVector with the textual content like 
>> below, and which I can then transfer to my NLP module,  for further 
>> processing.  But I want to get expert opinion on if this is the right way to 
>> handle this requirement.  Or are there more efficient ways of doing the 
>> transfer than converting to arrow first, doing the transfer and then 
>> converting back to  string for further processing.
>> 
>> Thanks for your thoughts and considered opinion on this.
>> 
>> VarCharVector stateVector = (VarCharVector) 
>> vectorSchemaRoot.getVector("state");
>> stateVector.allocateNew(textlines.size());
>> int k=0;
>> for ( String thisStr: textlines) {
>>   //nameVector.set(i, stateStr.getBytes());
>>   stateVector.set(k, thisStr.getBytes());
>>   k++;
>> }
>>   //System.out.println("i in state is " + i + " " + stateStr);
>>   //vectorSchemaRoot.setRowCount(i+1);
>>   vectorSchemaRoot.setRowCount(textlines.size());
>>   clientStreamListener.start(vectorSchemaRoot);
>>   clientStreamListener.putNext();
>>   clientStreamListener.completed();
>>   System.*out*.println(vectorSchemaRoot.getRowCount());
>> 
>> Sitaraman
>> 
>> *From: *David Li <[email protected]>
>> *Date: *Thursday, March 2, 2023 at 6:03 PM
>> *To: *dl <[email protected]>
>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
>> volumes of unstructured text
>> 
>> ***** EXTERNAL EMAIL *****
>> 
>> NLP is not something I'm familiar with. If your analysis works with Arrow or 
>> Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then it 
>> would likely benefit you to use Arrow up front instead of converting the 
>> data down the line. (For example, HuggingFace datasets use Arrow partly for 
>> its interoperability with other tools [1].) 
>> 
>>  
>> 
>> [1]: https://huggingface.co/docs/datasets/about_arrow 
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdatasets%2Fabout_arrow&data=05%7C01%7Cvilayannur.sitaraman%40hitachivantara.com%7Cd9f1fceac44546dc750408db1be92ead%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C638134461264097219%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=a80%2FF6jXy2ytaEBlAFSTF4wcIKnLVrQdjAji0dDHPx0%3D&reserved=0>
>> 
>>  
>> 
>> On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote:
>> 
>>> Hi David,
>>> 
>>> Thanks for the questions…
>>> 
>>> Are these two processes on the same machine:
>>> 
>>> No, two different processes on different machines
>>> 
>>>  
>>> 
>>> What exactly is the unstructured text
>>> 
>>> The text is the textual content of normal documents that enterprises have 
>>> such as pdf docx files.  I can split these into chunks before transferring 
>>> if needed. 
>>> 
>>>  
>>> 
>>> What is the python side planning to do:
>>> 
>>> Analyze and run ML models such as NLP on the text.
>>> 
>>> Sitaraman
>>> 
>>> *From: *David Li <[email protected]>
>>> *Date: *Thursday, March 2, 2023 at 5:33 PM
>>> *To: *dl <[email protected]>
>>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
>>> volumes of unstructured text
>>> 
>>> ***** EXTERNAL EMAIL *****
>>> 
>>> Possibly, but more details might help. Are these two processes on the same 
>>> machine, two components in the same process, two processes on different 
>>> machines? What exactly is the unstructured text - does it at least fit into 
>>> a column of data, or is it literally just a stream of text with no further 
>>> structure? What is the Python side planning to do with the text (for 
>>> instance, do you want to further analyze it with something like Pandas)?
>>> 
>>>  
>>> 
>>> On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote:
>>> 
>>>> Hi,
>>>> 
>>>>   My use case is the need to efficiently transport large volumes of 
>>>> unstructured text from a module in Java to a module in Python with 
>>>> possibly a massaging of the docs before transport. Is Arrow Flight/Arrow 
>>>> the right choice for this?  Why Why not?  Any advice appreciated.
>>>> 
>>>> Thanks
>>>> 
>>>> Sitaraman
>>>> 
>>>  
>>> 
>>  
>> 
>

Re: Is ArrowFlight/Arrow the right choice for transporting large volumes of unstructured text

Reply via email to