[GitHub] [arrow] t829702 edited a comment on pull request #2035: ARROW-2116: [JS] implement IPC writers

GitBox Fri, 18 Sep 2020 15:17:57 -0700


t829702 edited a comment on pull request #2035:
URL: https://github.com/apache/arrow/pull/2035#issuecomment-695107985



   > There are a few strategies to convert arbitrary JavaScript types into 
Arrow tables, and the strategy you pick depends on your needs. They all use the 
Builder classes under the hood, and generally follow this pattern:
   
   Thanks @trxcllnt  for all that information, now I am able to write a small 
script of pipelining the building blocks::
   
   looks like this, and still have a few questions:
   
   ```js
   const userNameDict = new arrow.Dictionary(new arrow.Utf8(), new 
arrow.Int32());
   // const senderNameDict = new arrow.Dictionary(new arrow.Utf8(), new 
arrow.Int32());
   const lineObjStruct = new arrow.Struct([
     arrow.Field.new({ name: 'date', type: new arrow.DateMillisecond() }),
     arrow.Field.new({ name: 'articleId', type: new arrow.Uint32() }),
     arrow.Field.new({ name: 'title', type: new arrow.Utf8() }),
     // Q1. here I have other columns are also userName, can different columns 
share a single Dictionary? I tried to create above one userNameDict and use it 
everywhere, but this ends up a messy arrow file with wrong names
     arrow.Field.new({ name: 'userName', type: userNameDict }),
     arrow.Field.new({ name: 'userId', type: new arrow.Uint32() }),
     arrow.Field.new({ name: 'wordCount', type: new arrow.Uint32() }),
     arrow.Field.new({ name: 'url', type: new arrow.Utf8() }),
     arrow.Field.new({
       name: 'tags',
       type: new arrow.List({ type: new arrow.Dictionary(new arrow.Utf8(), new 
arrow.Int32()), }),
     }),
     // ...
   
   async function main() {
     await pipeline(
       fs.createReadStream(inputFile, 'utf-8'),
       async function* (source) {
       // ... split each chunk to lines, and yield* lines
       },
       async function* (source) {
         for await (const line of source) {
           const obj = JSON.parse(line);
           // Q2: the obj has a 'date' field and is a JavaScript Date's JSON 
format like "2020-09-18T21:42:30.324Z", but the arrow.DateMillisecond does not 
recognize that, got all 0's if I do not pre-parse it, the parseDate here is 
just (obj) => { obj.date = Date.parse(obj.date); }     wonder is there a better 
arrow data type for JavaScript Date's JSON string format?
           parseDate(obj); obj.appreciationsReceived.forEach(parseDate);
           yield obj;
           ++count;
         }
       },
       arrow.Builder.throughAsyncIterable({
         type: lineObjStruct,
         queueingStrategy: 'bytes', highWaterMark: 1<<20,
       }),
       async function* (source) {
         for await (const chunk of source) {
           // Q3: is there a better way to create RecordBatch  than the static 
method arrow.RecordBatch.new ?   I suppose some way in the source like 
arrow.RecordBatch.from(chunk) or new operator to call its constructor directly, 
got no success
           const records = arrow.RecordBatch.new(chunk.data.childData, 
chunk.type.children);
           yield records;
           ++batches;
           allchunkbytes += records.byteLength;
           console.log(new Date, `yield RecordBatch ${batches} with 
${chunk.length} rows ${records.byteLength} bytes (accu ${allchunkbytes})`);
         }
       },
       arrow.RecordBatchStreamWriter.throughNode(),
       fs.createWriteStream(outputFile),
     );
   
     console.log(end, `written ${count} objs in ${batches} batches, 
${allchunkbytes} bytes in ${+end-started}ms`);
   ```
   
   it runs in 2.3s convert a 50MB line-json file to a 21MB arrow file, not too 
bad for each single file, most of my dataset source files are less than 100MB, 
but there are huge number of them,   what's the better (optimized and ergonomic 
way) to convert them?
   because my app is in JS which generates those line-json files, I am planning 
to add a post-processing step to generate additional arrow files for each one, 
   
   ```console
   2020-09-18T21:49:20.995Z yield RecordBatch 1 with 3872 rows 1522432 bytes 
(accu 1522432)
   2020-09-18T21:49:21.165Z yield RecordBatch 2 with 3782 rows 1449216 bytes 
(accu 2971648)
   2020-09-18T21:49:21.290Z yield RecordBatch 3 with 3765 rows 1413312 bytes 
(accu 4384960)
   2020-09-18T21:49:21.411Z yield RecordBatch 4 with 3784 rows 1453952 bytes 
(accu 5838912)
   2020-09-18T21:49:21.559Z yield RecordBatch 5 with 3801 rows 1505856 bytes 
(accu 7344768)
   2020-09-18T21:49:21.728Z yield RecordBatch 6 with 3807 rows 1484672 bytes 
(accu 8829440)
   2020-09-18T21:49:21.877Z yield RecordBatch 7 with 3806 rows 1513280 bytes 
(accu 10342720)
   2020-09-18T21:49:22.013Z yield RecordBatch 8 with 3814 rows 1472640 bytes 
(accu 11815360)
   2020-09-18T21:49:22.151Z yield RecordBatch 9 with 3848 rows 1467712 bytes 
(accu 13283072)
   2020-09-18T21:49:22.270Z yield RecordBatch 10 with 3827 rows 1442624 bytes 
(accu 14725696)
   2020-09-18T21:49:22.415Z yield RecordBatch 11 with 3984 rows 1554560 bytes 
(accu 16280256)
   2020-09-18T21:49:22.560Z yield RecordBatch 12 with 4069 rows 1523200 bytes 
(accu 17803456)
   2020-09-18T21:49:22.697Z yield RecordBatch 13 with 3986 rows 1398272 bytes 
(accu 19201728)
   2020-09-18T21:49:22.821Z yield RecordBatch 14 with 3878 rows 1348096 bytes 
(accu 20549824)
   2020-09-18T21:49:22.938Z yield RecordBatch 15 with 3118 rows 1091200 bytes 
(accu 21641024)
   2020-09-18T21:49:22.975Z written 57141 objs in 15 batches, 21641024 bytes in 
2186ms
   
   real    0m2.334s
   user    0m2.853s
   sys     0m0.225s
   ```
   
   1. Q1. here I have other columns are also `userName`, this dataset is a 
network of people who appreciates each other's article by clapping on that 
(like Medium); wonder can different columns share a single Dictionary (to save 
some space)? I tried to create above one `userNameDict` and use it everywhere, 
but it ends up a messy arrow file with wrong names
   2. Q2: the obj has a `date` field and is a JavaScript Date's JSON format 
like `"2020-09-18T21:42:30.324Z"`, but the `arrow.DateMillisecond` does not 
recognize that, got all 0's if I do not pre-parse it, the parseDate here is 
just `(obj) => { obj.date = Date.parse(obj.date); }`     wonder is there a 
better arrow data type for JavaScript Date's JSON string format? or to embed 
this parser into data type like `new arrow.DateMillisecond(Date.parse)` ??
   3. Q3: is there a better way to create RecordBatch  than the static method 
`arrow.RecordBatch.new` ?   I suppose some way in the source like 
`arrow.RecordBatch.from(chunk)` or new operator to call its constructor 
directly, got no success
   3. you see the builder strategy `{ queueingStrategy: 'bytes', highWaterMark: 
1<<20, }` generates 1.4 or 1.5MB in each RecordBatch, I have tried 64KB, or 
256KB, or this (1<<20) means 1MB but seems not much difference, do you have 
ideas to use this better?
   
   > It is not an optimized or ergonomic way to interact with Arrow
   
   if JS is not the way to interact with Arrow, then what is the purpose of JS 
implementation?   Is the JS implementation supposed to be read-only?    What 
are other good use cases for JS implementation?
   for Observable I can see,  but Observable is not good for files too big (not 
more than 100MB maybe)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] t829702 edited a comment on pull request #2035: ARROW-2116: [JS] implement IPC writers

Reply via email to