Re: [Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?

2019-04-26 Thread Wes McKinney
The Buffer struct / metadata need not be a multiple of 8 bytes necessarily
but you must write padding bytes when emitting the IPC protocol. So if your
validity bitmap is 2 bytes in-memory then you must write at least 6 more
bytes of padding on the wire.

On Fri, Apr 26, 2019, 3:48 PM Micah Kornfield  wrote:

> Hi Neville,
> Here is my understanding.  Per the spec [1], 8 bytes of padding is
> allowed/required but 64 bytes is recommended (Is "bits" in your e-mail is a
> typo?).  The main rationale is to allow SIMD instructions.
>
> For actual record batches only padding to a multiple of 8-bytes are
> required [2].
>
> Note that slicing of buffers still might require mem-copies/padding.
>
> Thanks,
> Micah
>
> [1] https://arrow.apache.org/docs/format/Layout.html#alignment-and-padding
> [2] https://arrow.apache.org/docs/format/IPC.html
>
> On Fri, Apr 26, 2019 at 1:29 PM Neville Dipale 
> wrote:
>
> > Hi Arrow developers,
> >
> > I'm currently working on IPC in Rust, specifically reading Arrow files.
> > I've noticed that null buffers/bitmaps are always padded to 64 bits (from
> > pyarrow, not sure about others), while in Rust we pad to 8 bits.
> >
> > 1. Is this fine re. Rust per the spec?
> >
> > I'm having issues with reading, but only because I'm comparing array data
> > and not only the values and nullness of slots. I see this being more of a
> > problem when writing to files and streams as we'd need to pad null
> buffers
> > almost every time (since for large arrays IPC could need 2048 while we
> have
> > 2046, so it's not a small data issue)
> >
> > 2. If implementations are allowed to choose either 8 or 64, are the Rust
> > commiters happy with us changing to 64-bit padding?
> >
> > The benefits of changing to 64 would be removing the need to then pad the
> > buffer when writing to streams and files, and it'll make us more
> compatible
> > with other implementations. I suspect this would still come as an issue
> > when we get to add Rust to interop tests.
> >
> > I tried changing to 64-bit before writing this mail, but bit-fu is still
> > beyond my knowledge, so I'd need help from someone else with implementing
> > this, or at least letting me know which lines to change. I don't mind
> then
> > making sure all tests still pass.
> >
> > My goal is to complete IPC work by 0.14 release, so this would be a bit
> > urgent as I'm stuck right now.
> >
> > Thanks
> > Neville
> >
>


Re: [Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?

2019-04-26 Thread Micah Kornfield
Hi Neville,
Here is my understanding.  Per the spec [1], 8 bytes of padding is
allowed/required but 64 bytes is recommended (Is "bits" in your e-mail is a
typo?).  The main rationale is to allow SIMD instructions.

For actual record batches only padding to a multiple of 8-bytes are
required [2].

Note that slicing of buffers still might require mem-copies/padding.

Thanks,
Micah

[1] https://arrow.apache.org/docs/format/Layout.html#alignment-and-padding
[2] https://arrow.apache.org/docs/format/IPC.html

On Fri, Apr 26, 2019 at 1:29 PM Neville Dipale 
wrote:

> Hi Arrow developers,
>
> I'm currently working on IPC in Rust, specifically reading Arrow files.
> I've noticed that null buffers/bitmaps are always padded to 64 bits (from
> pyarrow, not sure about others), while in Rust we pad to 8 bits.
>
> 1. Is this fine re. Rust per the spec?
>
> I'm having issues with reading, but only because I'm comparing array data
> and not only the values and nullness of slots. I see this being more of a
> problem when writing to files and streams as we'd need to pad null buffers
> almost every time (since for large arrays IPC could need 2048 while we have
> 2046, so it's not a small data issue)
>
> 2. If implementations are allowed to choose either 8 or 64, are the Rust
> commiters happy with us changing to 64-bit padding?
>
> The benefits of changing to 64 would be removing the need to then pad the
> buffer when writing to streams and files, and it'll make us more compatible
> with other implementations. I suspect this would still come as an issue
> when we get to add Rust to interop tests.
>
> I tried changing to 64-bit before writing this mail, but bit-fu is still
> beyond my knowledge, so I'd need help from someone else with implementing
> this, or at least letting me know which lines to change. I don't mind then
> making sure all tests still pass.
>
> My goal is to complete IPC work by 0.14 release, so this would be a bit
> urgent as I'm stuck right now.
>
> Thanks
> Neville
>


[Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?

2019-04-26 Thread Neville Dipale
Hi Arrow developers,

I'm currently working on IPC in Rust, specifically reading Arrow files.
I've noticed that null buffers/bitmaps are always padded to 64 bits (from
pyarrow, not sure about others), while in Rust we pad to 8 bits.

1. Is this fine re. Rust per the spec?

I'm having issues with reading, but only because I'm comparing array data
and not only the values and nullness of slots. I see this being more of a
problem when writing to files and streams as we'd need to pad null buffers
almost every time (since for large arrays IPC could need 2048 while we have
2046, so it's not a small data issue)

2. If implementations are allowed to choose either 8 or 64, are the Rust
commiters happy with us changing to 64-bit padding?

The benefits of changing to 64 would be removing the need to then pad the
buffer when writing to streams and files, and it'll make us more compatible
with other implementations. I suspect this would still come as an issue
when we get to add Rust to interop tests.

I tried changing to 64-bit before writing this mail, but bit-fu is still
beyond my knowledge, so I'd need help from someone else with implementing
this, or at least letting me know which lines to change. I don't mind then
making sure all tests still pass.

My goal is to complete IPC work by 0.14 release, so this would be a bit
urgent as I'm stuck right now.

Thanks
Neville


[jira] [Created] (ARROW-5222) [Python] Issues with installing pyarrow for development on MacOS

2019-04-26 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5222:
--

 Summary: [Python] Issues with installing pyarrow for development 
on MacOS
 Key: ARROW-5222
 URL: https://issues.apache.org/jira/browse/ARROW-5222
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Neal Richardson
 Fix For: 0.14.0


I tried following the 
[instructions|https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]
 for installing pyarrow for developers on macos, and I ran into quite a bit of 
difficulty. I'm hoping we can improve our documentation and/or tooling to make 
this a smoother process. 

I know we can't anticipate every quirk of everyone's dev environment, but in my 
case, I was getting set up on a new machine, so this was from a clean slate. 
I'm also new to contributing to the project, so I'm a "clean slate" in that 
regard too, so my ignorance may be exposing other assumptions in the docs.
 # The instructions recommend using conda, but as this [Stack Overflow 
question|https://stackoverflow.com/questions/55798166/cmake-fails-with-when-attempting-to-compile-simple-test-program]
 notes, cmake fails. Uwe helpfully suggested installing an older MacOS SDK from 
[here|https://github.com/phracker/MacOSX-SDKs/releases]. That may work, but I'm 
personally wary to install binaries from an unofficial github account, let 
alone record that in our docs as an official recommendation. Either way, we 
should update the docs either to note this necessity or to recommend against 
installing with conda on macos.
 # After that, I tried to go the Homebrew path. Ultimately this did succeed, 
but it was rough. It seemed that I had to `brew install` a lot of packages that 
weren't included in the arrow/python/Brewfile (i.e. try to cmake, see what 
missing dependency it failed on, `brew install` it, retry `cmake`, and repeat). 
Among the libs I installed this way were double-conversion snappy brotli 
protobuf gtest rapidjson flatbuffers lz4 zstd c-ares boost. It's not clear how 
many of these extra dependencies I had to install were because I'd only 
installed the xcode command-line tools and not the full xcode from the App 
Store; regardless, the Brewfile should be complete if we want to use it.
 # In searching Jira for the double-conversion issue (the first one I hit), I 
found [this issue/PR|https://github.com/apache/arrow/pull/4132/files], which 
added double-conversion to a different Brewfile, in c_glib. So I tried `brew 
bundle` installing that Brewfile. It would probably be good to have a common 
Brewfile for the C++ setup, which the python and glib ones could load and then 
add any other extra dependencies, if necessary. That way, there's one place to 
add common dependencies.
 # I got close here but still had issues with `BOOST_HOME` not being found, 
even though I had brew-installed it. From the console output, it appeared that 
even though I was not using conda and did not have an active conda environment 
(I'd even done `conda env remove --name pyarrow-dev`), the cmake configuration 
script detected that conda existed and decided to use conda to resolve 
dependencies. I tried setting lots of different environment variables to tell 
cmake not to use conda, but ultimately I was only able to get past this by 
deleting conda from my system entirely.
 # This let me get to the point of being able to `import pyarrow`. But then 
running tests failed because the `hypothesis` package was not installed. I see 
that it is included in requirements-test.txt and setup.py under tests_require, 
but I followed the installation instructions and this package did not end up in 
my virtualenv. `pip install hypothesis` resolved it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Brian Bowman
Hello Wes,

Thanks for the info!  I'm working to better understand Parquet/Arrow design and 
development processes.   No hurry for LARGE_BYTE_ARRAY.

-Brian


On 4/26/19, 11:14 AM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

I doubt that such a change could be made on a short time horizon.
Collecting feedback and building consensus (if it is even possible)
with stakeholders would take some time. The appropriate place to have
the discussion is here on the mailing list, though

Thanks

On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman  wrote:
>
> Hello Wes/all,
>
> A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
resorting to other alternatives.  Is this something that could be done in 
Parquet over the next few months?  I have a lot of experience with file 
formats/storage layer internals and can contribute for Parquet C++.
>
> -Brian
>
> On 4/5/19, 3:44 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> hi Brian,
>
> Just to comment from the C++ side -- the 64-bit issue is a limitation
> of the Parquet format itself and not related to the C++
> implementation. It would be possibly interesting to add a
> LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
> doing much the same in Apache Arrow for in-memory)
>
> - Wes
>
> On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  
wrote:
> >
> > I don't think that's what you would want to do. Parquet will 
eventually
> > compress large values, but not after making defensive copies and 
attempting
> > to encode them. In the end, it will be a lot more overhead, plus 
the work
> > to make it possible. I think you'd be much better of compressing 
before
> > storing in Parquet if you expect good compression rates.
> >
> > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  
wrote:
> >
> > > My hope is that these large ByteArray values will encode/compress 
to a
> > > fraction of their original size.  FWIW, cpp/src/parquet/
> > > column_writer.cc/.h has int64_t offset and length fields all over 
the
> > > place.
> > >
> > > External file references to BLOBS is doable but not the elegant,
> > > integrated solution I was hoping for.
> > >
> > > -Brian
> > >
> > > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> > >
> > > *EXTERNAL*
> > > Looks like we will need a new encoding for this:
> > > https://github.com/apache/parquet-format/blob/master/Encodings.md
> > >
> > > That doc specifies that the plain encoding uses a 4-byte length. 
That's
> > > not going to be a quick fix.
> > >
> > > Now that I'm thinking about this a bit more, does it make sense 
to support
> > > byte arrays that are more than 2GB? That's far larger than the 
size of a
> > > row group, let alone a page. This would completely break memory 
management
> > > in the JVM implementation.
> > >
> > > Can you solve this problem using a BLOB type that references an 
external
> > > file with the gigantic values? Seems to me that values this large 
should go
> > > in separate files, not in a Parquet file where it would destroy 
any benefit
> > > from using the format.
> > >
> > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
 wrote:
> > >
> > >> Hello Ryan,
> > >>
> > >> Looks like it's limited by both the Parquet implementation and 
the Thrift
> > >> message methods.  Am I missing anything?
> > >>
> > >> From cpp/src/parquet/types.h
> > >>
> > >> struct ByteArray {
> > >>   ByteArray() : len(0), ptr(NULLPTR) {}
> > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), 
ptr(ptr) {}
> > >>   uint32_t len;
> > >>   const uint8_t* ptr;
> > >> };
> > >>
> > >> From cpp/src/parquet/thrift.h
> > >>
> > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* 
len, T*
> > >> deserialized_msg) {
> > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, 
OutputStream*
> > >> out)
> > >>
> > >> -Brian
> > >>
> > >> On 4/5/19, 1:32 PM, "Ryan Blue"  
wrote:
> > >>
> > >> EXTERNAL
> > >>
> > >> Hi Brian,
> > >>
> > >> This seems like something we should allow. What imposes the 
current
> > >> limit?
> > >> Is it in the thrift format, or just the implementations?
> > >>
> > >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 

> > >> wrote:
> > >>
> > >> > All,
> > >> >
> > >> 

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Wes McKinney
hi Brian,

I doubt that such a change could be made on a short time horizon.
Collecting feedback and building consensus (if it is even possible)
with stakeholders would take some time. The appropriate place to have
the discussion is here on the mailing list, though

Thanks

On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman  wrote:
>
> Hello Wes/all,
>
> A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
> resorting to other alternatives.  Is this something that could be done in 
> Parquet over the next few months?  I have a lot of experience with file 
> formats/storage layer internals and can contribute for Parquet C++.
>
> -Brian
>
> On 4/5/19, 3:44 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> hi Brian,
>
> Just to comment from the C++ side -- the 64-bit issue is a limitation
> of the Parquet format itself and not related to the C++
> implementation. It would be possibly interesting to add a
> LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
> doing much the same in Apache Arrow for in-memory)
>
> - Wes
>
> On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  
> wrote:
> >
> > I don't think that's what you would want to do. Parquet will eventually
> > compress large values, but not after making defensive copies and 
> attempting
> > to encode them. In the end, it will be a lot more overhead, plus the 
> work
> > to make it possible. I think you'd be much better of compressing before
> > storing in Parquet if you expect good compression rates.
> >
> > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  
> wrote:
> >
> > > My hope is that these large ByteArray values will encode/compress to a
> > > fraction of their original size.  FWIW, cpp/src/parquet/
> > > column_writer.cc/.h has int64_t offset and length fields all over the
> > > place.
> > >
> > > External file references to BLOBS is doable but not the elegant,
> > > integrated solution I was hoping for.
> > >
> > > -Brian
> > >
> > > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> > >
> > > *EXTERNAL*
> > > Looks like we will need a new encoding for this:
> > > https://github.com/apache/parquet-format/blob/master/Encodings.md
> > >
> > > That doc specifies that the plain encoding uses a 4-byte length. 
> That's
> > > not going to be a quick fix.
> > >
> > > Now that I'm thinking about this a bit more, does it make sense to 
> support
> > > byte arrays that are more than 2GB? That's far larger than the size 
> of a
> > > row group, let alone a page. This would completely break memory 
> management
> > > in the JVM implementation.
> > >
> > > Can you solve this problem using a BLOB type that references an 
> external
> > > file with the gigantic values? Seems to me that values this large 
> should go
> > > in separate files, not in a Parquet file where it would destroy any 
> benefit
> > > from using the format.
> > >
> > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman  
> wrote:
> > >
> > >> Hello Ryan,
> > >>
> > >> Looks like it's limited by both the Parquet implementation and the 
> Thrift
> > >> message methods.  Am I missing anything?
> > >>
> > >> From cpp/src/parquet/types.h
> > >>
> > >> struct ByteArray {
> > >>   ByteArray() : len(0), ptr(NULLPTR) {}
> > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> > >>   uint32_t len;
> > >>   const uint8_t* ptr;
> > >> };
> > >>
> > >> From cpp/src/parquet/thrift.h
> > >>
> > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, 
> T*
> > >> deserialized_msg) {
> > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
> > >> out)
> > >>
> > >> -Brian
> > >>
> > >> On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:
> > >>
> > >> EXTERNAL
> > >>
> > >> Hi Brian,
> > >>
> > >> This seems like something we should allow. What imposes the 
> current
> > >> limit?
> > >> Is it in the thrift format, or just the implementations?
> > >>
> > >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
> 
> > >> wrote:
> > >>
> > >> > All,
> > >> >
> > >> > SAS requires support for storing varying-length character and
> > >> binary blobs
> > >> > with a 2^64 max length in Parquet.   Currently, the ByteArray 
> len
> > >> field is
> > >> > a unint32_t.   Looks this the will require incrementing the 
> Parquet
> > >> file
> > >> > format version and changing ByteArray len to uint64_t.
> > >> >
> > >> > Have there been any requests for this or other Parquet 
> developments
> > >> that
> > >> > require file format versioning changes?
> > >> >
> > >> > I realize this a non-trivial ask.  Thanks for 

Re: [VOTE] Add 64-bit offset list, binary, string (utf8) types to the Arrow columnar format

2019-04-26 Thread Brian Bowman
Can non-Arrow PMC members/committers vote?  

If so, +1 

-Brian

On 4/25/19, 4:34 PM, "Wes McKinney"  wrote:

EXTERNAL

In a recent mailing list discussion [1] Micah Kornfield has proposed
to add new list and variable-size binary and unicode types to the
Arrow columnar format with 64-bit signed integer offsets, to be used
in addition to the existing 32-bit offset varieties. These will be
implemented as new types in the Type union in Schema.fbs (the
particular names can be debated in the PR that implements them):

LargeList
LargeBinary
LargeString [UTF8]

While very large contiguous columns are not a principle use case for
the columnar format, it has been observed empirically that there are
applications that use the format to represent datasets where
realizations of data can sometimes exceed the 2^31 - 1 "capacity" of a
column and cannot be easily (or at all) split into smaller chunks.

Please vote whether to accept the changes. The vote will be open for at
least 72 hours.

[ ] +1 Accept the additions to the columnar format
[ ] +0
[ ] -1 Do not accept the changes because...

Thanks,
Wes

[1]: 
https://lists.apache.org/thread.html/8088eca21b53906315e2bbc35eb2d246acf10025b5457eccc7a0e8a3@%3Cdev.arrow.apache.org%3E




[jira] [Created] (ARROW-5221) Improvement the performance of class SegmentsUtil

2019-04-26 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5221:
---

 Summary: Improvement the performance of class SegmentsUtil
 Key: ARROW-5221
 URL: https://issues.apache.org/jira/browse/ARROW-5221
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Liya Fan
Assignee: Liya Fan


Improve the performance of class SegmentsUtil from two points:
 # In method allocateReuseBytes, the generated byte array should be cached for 
reuse, if the size does not exceed MAX_BYTES_LENGTH. However, the array is not 
cached if bytes.length < length, and this will lead to performance overhead:

 

if (bytes == null) {
  if (length <= MAX_BYTES_LENGTH) {
    bytes = new byte[MAX_BYTES_LENGTH];
    BYTES_LOCAL.set(bytes);
  } else {
    bytes = new byte[length];
  }
 } else if (bytes.length < length) {
  bytes = new byte[length];
 }

 

2. To evaluate the offset, an integer is bitand with a mask to clear to low 
bits, and then shift right. The bitand is useless:

 

((index & BIT_BYTE_POSITION_MASK) >>> 3)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-04-26 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5220:


 Summary: [Python] index / unknown columns in specified schema in 
Table.from_pandas
 Key: ARROW-5220
 URL: https://issues.apache.org/jira/browse/ARROW-5220
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-26 Thread Antoine Pitrou


It's "arbitrary" from Arrow's point of view, because Arrow itself cannot
represent this data (except as a binary blob).  Though, as Micah said,
this may change at some point.

Instead of extending Arrow to fit this use case, perhaps it would be
better to write a separate library that sits atop Arrow for your purposes?

Regards

Antoine.


Le 26/04/2019 à 04:20, Shawn Yang a écrit :
> Hi Antoine,
> It's not arbitrary data type, it's the type similar to data types in
> https://spark.apache.org/docs/latest/sql-reference.html#data-types and
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/sql.html#data-types.
> Our framework is a framework that is similar to
> flink streaming, but is written in c++/java/python. And data need to be
> transferred from java process to python process by tcp or shared
>  memory if they are on the same machine. For example, one case is online
> learning, the features is generated in java streaming, and
> then training data is transferred to python tensorflow worker for training.
> In system such as flink, data is row by row, not columnar, so there need a
> serialization framework
> to serialize data row by row in  language-independent way for
> c++/java/python.
> 
> Regards