Re: Is ArrowFlight/Arrow the right choice for transporting large volumes of unstructured text

Vilayannur Sitaraman Thu, 02 Mar 2023 18:50:40 -0800

I am primarily concerned about the efficient transfer of such texts across 
machines and across different programming languages….I am comfortable treating 
a text chunk as a VarCharVector with the textual content like below, and which 
I can then transfer to my NLP module,  for further processing.  But I want to 
get expert opinion on if this is the right way to handle this requirement.  Or 
are there more efficient ways of doing the transfer than converting to arrow 
first, doing the transfer and then converting back to  string for further 
processing.
Thanks for your thoughts and considered opinion on this.
VarCharVector stateVector = (VarCharVector) vectorSchemaRoot.getVector("state");
stateVector.allocateNew(textlines.size());
int k=0;
for ( String thisStr: textlines) {
  //nameVector.set(i, stateStr.getBytes());
  stateVector.set(k, thisStr.getBytes());
  k++;
}
  //System.out.println("i in state is " + i + " " + stateStr);
  //vectorSchemaRoot.setRowCount(i+1);
  vectorSchemaRoot.setRowCount(textlines.size());
  clientStreamListener.start(vectorSchemaRoot);
  clientStreamListener.putNext();
  clientStreamListener.completed();
  System.out.println(vectorSchemaRoot.getRowCount());
Sitaraman
From: David Li <[email protected]>
Date: Thursday, March 2, 2023 at 6:03 PM
To: dl <[email protected]>
Subject: Re: Is ArrowFlight/Arrow the right choice for transporting large 
volumes of unstructured text
***** EXTERNAL EMAIL *****
NLP is not something I'm familiar with. If your analysis works with Arrow or 
Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then it 
would likely benefit you to use Arrow up front instead of converting the data 
down the line. (For example, HuggingFace datasets use Arrow partly for its 
interoperability with other tools [1].)


[1]: 
https://huggingface.co/docs/datasets/about_arrow<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdatasets%2Fabout_arrow&data=05%7C01%7Cvilayannur.sitaraman%40hitachivantara.com%7C12c65cf8c69b4f00505d08db1b8b8714%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C638134058315276423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mB0gM3G3DIV4zupyYuurKF96ExSHmMk3z%2FLhKspCucg%3D&reserved=0>

On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote:

Hi David,

Thanks for the questions…

Are these two processes on the same machine:

No, two different processes on different machines



What exactly is the unstructured text

The text is the textual content of normal documents that enterprises have such 
as pdf docx files.  I can split these into chunks before transferring if needed.



What is the python side planning to do:

Analyze and run ML models such as NLP on the text.

Sitaraman

From: David Li <[email protected]>
Date: Thursday, March 2, 2023 at 5:33 PM
To: dl <[email protected]>
Subject: Re: Is ArrowFlight/Arrow the right choice for transporting large 
volumes of unstructured text

***** EXTERNAL EMAIL *****

Possibly, but more details might help. Are these two processes on the same 
machine, two components in the same process, two processes on different 
machines? What exactly is the unstructured text - does it at least fit into a 
column of data, or is it literally just a stream of text with no further 
structure? What is the Python side planning to do with the text (for instance, 
do you want to further analyze it with something like Pandas)?



On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote:

Hi,

  My use case is the need to efficiently transport large volumes of 
unstructured text from a module in Java to a module in Python with possibly a 
massaging of the docs before transport. Is Arrow Flight/Arrow the right choice 
for this?  Why Why not?  Any advice appreciated.

Thanks

Sitaraman

Re: Is ArrowFlight/Arrow the right choice for transporting large volumes of unstructured text

Reply via email to