HTTP/2 is not a magic 'make things faster' button (and due to head-of-line blocking, it can be slower than HTTP/1 in some circumstances!), plus you can use HTTP/2 outside of Flight/gRPC. (Or you could just use gRPC itself, though by default gRPC is going to force some extra copies on you.)
Flight encourages parallelization and separation of control/data, but of course you can implement those things yourself. I would still encourage you to try Arrow/Flight, especially as many frameworks in this space do interoperate with Arrow, but I'm not sure I'd treat Arrow as just a way to accelerate your network transfers. On Fri, Mar 3, 2023, at 16:33, Vilayannur Sitaraman wrote: > Hi David, > I was of the belief that gRPC which uses http2 as transport is more > efficient. Plus the benefits listed as pasted below for Arrow Flight? > “ > We wanted Flight to enable systems to create horizontally scalable data > services without having to deal with such bottlenecks. A client request to a > dataset using the GetFlightInfo RPC returns a list of *endpoints*, each of > which contains a server location and a *ticket* to send that server in a > DoGet request to obtain a part of the full dataset. To get access to the > entire dataset, all of the endpoints must be consumed. While Flight streams > are not necessarily ordered, we provide for application-defined metadata > which can be used to serialize ordering information. > This multiple-endpoint pattern has a number of benefits: > • Endpoints can be read by clients in parallel. > • The service that serves the GetFlightInfo “planning” request can delegate > work to sibling services to take advantage of data locality or simply to help > with load balancing. > • Nodes in a distributed cluster can take on different roles. For example, a > subset of nodes might be responsible for planning queries while other nodes > exclusively fulfill data stream (DoGet or DoPut) requests. > > > “ > > *From: *David Li <[email protected]> > *Date: *Friday, March 3, 2023 at 5:15 AM > *To: *dl <[email protected]> > *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large > volumes of unstructured text > ***** EXTERNAL EMAIL ***** > If you are always just going to convert back to string, then I don't see why > you wouldn't just use HTTP. > > On Thu, Mar 2, 2023, at 21:50, Vilayannur Sitaraman wrote: >> I am primarily concerned about the efficient transfer of such texts across >> machines and across different programming languages….I am comfortable >> treating a text chunk as a VarCharVector with the textual content like >> below, and which I can then transfer to my NLP module, for further >> processing. But I want to get expert opinion on if this is the right way to >> handle this requirement. Or are there more efficient ways of doing the >> transfer than converting to arrow first, doing the transfer and then >> converting back to string for further processing. >> >> Thanks for your thoughts and considered opinion on this. >> >> VarCharVector stateVector = (VarCharVector) >> vectorSchemaRoot.getVector("state"); >> stateVector.allocateNew(textlines.size()); >> int k=0; >> for ( String thisStr: textlines) { >> //nameVector.set(i, stateStr.getBytes()); >> stateVector.set(k, thisStr.getBytes()); >> k++; >> } >> //System.out.println("i in state is " + i + " " + stateStr); >> //vectorSchemaRoot.setRowCount(i+1); >> vectorSchemaRoot.setRowCount(textlines.size()); >> clientStreamListener.start(vectorSchemaRoot); >> clientStreamListener.putNext(); >> clientStreamListener.completed(); >> System.*out*.println(vectorSchemaRoot.getRowCount()); >> >> Sitaraman >> >> *From: *David Li <[email protected]> >> *Date: *Thursday, March 2, 2023 at 6:03 PM >> *To: *dl <[email protected]> >> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large >> volumes of unstructured text >> >> ***** EXTERNAL EMAIL ***** >> >> NLP is not something I'm familiar with. If your analysis works with Arrow or >> Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then it >> would likely benefit you to use Arrow up front instead of converting the >> data down the line. (For example, HuggingFace datasets use Arrow partly for >> its interoperability with other tools [1].) >> >> >> >> [1]: https://huggingface.co/docs/datasets/about_arrow >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdatasets%2Fabout_arrow&data=05%7C01%7Cvilayannur.sitaraman%40hitachivantara.com%7Cd9f1fceac44546dc750408db1be92ead%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C638134461264097219%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=a80%2FF6jXy2ytaEBlAFSTF4wcIKnLVrQdjAji0dDHPx0%3D&reserved=0> >> >> >> >> On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote: >> >>> Hi David, >>> >>> Thanks for the questions… >>> >>> Are these two processes on the same machine: >>> >>> No, two different processes on different machines >>> >>> >>> >>> What exactly is the unstructured text >>> >>> The text is the textual content of normal documents that enterprises have >>> such as pdf docx files. I can split these into chunks before transferring >>> if needed. >>> >>> >>> >>> What is the python side planning to do: >>> >>> Analyze and run ML models such as NLP on the text. >>> >>> Sitaraman >>> >>> *From: *David Li <[email protected]> >>> *Date: *Thursday, March 2, 2023 at 5:33 PM >>> *To: *dl <[email protected]> >>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large >>> volumes of unstructured text >>> >>> ***** EXTERNAL EMAIL ***** >>> >>> Possibly, but more details might help. Are these two processes on the same >>> machine, two components in the same process, two processes on different >>> machines? What exactly is the unstructured text - does it at least fit into >>> a column of data, or is it literally just a stream of text with no further >>> structure? What is the Python side planning to do with the text (for >>> instance, do you want to further analyze it with something like Pandas)? >>> >>> >>> >>> On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote: >>> >>>> Hi, >>>> >>>> My use case is the need to efficiently transport large volumes of >>>> unstructured text from a module in Java to a module in Python with >>>> possibly a massaging of the docs before transport. Is Arrow Flight/Arrow >>>> the right choice for this? Why Why not? Any advice appreciated. >>>> >>>> Thanks >>>> >>>> Sitaraman >>>> >>> >>> >> >> >
