Re: Arrow 15 parquet nanosecond change

2024-02-21 Thread Jacek Pliszka
Hi!

pq.write_table(
table, config.output_filename, coerce_timestamps="us",
allow_truncated_timestamps=True,
)

allows you to write as us instead of ns.

BR

J


śr., 21 lut 2024 o 21:44 Li Jin  napisał(a):

> Hi,
>
> My colleague has informed me that during the Arrow 12->15 upgrade, he found
> that writing a pandas Dataframe with datetime64[ns] to parquet will result
> in nanosecond metadata and nanosecond values.
>
> I wonder if this is something configurable to the old behavior so we can
> enable “nanosecond in parquet” gradually? There are code that reads parquet
> files that don’t handle parquet nanosecond now.
>
> Thanks!
> Li
>


Re: Is there anyway to resize record batches

2023-11-22 Thread Jacek Pliszka
Re 4. you create ChunkedArray from Array.

BR

J

śr., 22 lis 2023 o 20:48 Aldrin  napisał(a):

> Assuming the C++ implementation, Jacek's suggestion (#3 below) is probably
> best. Here is some extra context:
>
> 1. You can slice larger RecordBatches [1]
> 2. You can make a larger RecordBatch [2] from columns of smaller
> RecordBatches [3] probably using the correct type of Builder [4] and with a
> bit of resistance from the various types
> 3. As Jacek said, you can wrap smaller RecordBatches together as a Table
> [5], combine the chunks [6], and then convert back to RecordBatches using a
> TableBatchReader [7] if necessary
> 4. I didn't see anything useful in the Compute API for concatenating
> arbitrary Arrays or RecordBatches, but you can use Selection functions [8]
> instead of Slicing for anything that's too big.
>
>
> [1]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch5SliceE7int64_t7int64_t
>
> [2]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow11RecordBatch4MakeENSt10shared_ptrI6SchemaEE7int64_tNSt6vectorINSt10shared_ptrI5Array
> [3]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow11RecordBatch6columnEi
> [4]: https://arrow.apache.org/docs/cpp/arrays.html#building-an-array
>
> [5]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow5Table17FromRecordBatchesENSt10shared_ptrI6SchemaEERKNSt6vectorINSt10shared_ptrI11RecordBatch
> [6]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow5Table13CombineChunksEP10MemoryPool
> [7]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow16TableBatchReaderE
>
> [8]: https://arrow.apache.org/docs/cpp/compute.html#selections
>
>
>
> # --
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka <
> jacek.plis...@gmail.com> wrote:
>
>
> > Hi!
> >
>
> > I think some code is needed for clarity. You can concatenate tables (and
> > combine_chunks afterwards) or arrays. Then pass such concatenated one.
> >
>
> > Regards,
> >
>
> > Jacek
> >
>
> > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david@blackrock.com
> .invalid
> >
>
> > napisał(a):
> >
>
> > > I've got 36 million rows of data which ends up as a record batch with
> 3000
> > > batches ranging from 12k to 300k rows each. I'm assuming these batches
> are
> > > created using the multithreaded CSV file reader..
> > >
>
> > > Is there anyway to reorg the data into sometime like 36 batches
> consistent
> > > of 1 million rows each?
> > >
>
> > > What I'm seeing when we try to load this data using the ADBC Snowflake
> > > driver is that each individual batch is executed as a bind array
> insert in
> > > the Snowflake Go Driver.
> > > 3,000 bind array inserts is taking 3 hours..
> > >
>
> > > This message may contain information that is confidential or
> privileged.
> > > If you are not the intended recipient, please advise the sender
> immediately
> > > and delete this message. See
> > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > > further information. Please refer to
> > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > > information about BlackRock’s Privacy Policy.
> > >
>
> > > For a list of BlackRock's office addresses worldwide, see
> > > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > >
>
> > > © 2023 BlackRock, Inc. All rights reserved.


Re: Is there anyway to resize record batches

2023-11-22 Thread Jacek Pliszka
Hi!

I think some code is needed for clarity.  You can concatenate tables (and
combine_chunks afterwards) or arrays. Then pass such concatenated one.

Regards,

Jacek

śr., 22 lis 2023 o 19:54 Lee, David (PAG) 
napisał(a):

> I've got 36 million rows of data which ends up as a record batch with 3000
> batches ranging from 12k to 300k rows each. I'm assuming these batches are
> created using the multithreaded CSV file reader..
>
> Is there anyway to reorg the data into sometime like 36 batches consistent
> of 1 million rows each?
>
> What I'm seeing when we try to load this data using the ADBC Snowflake
> driver is that each individual batch is executed as a bind array insert in
> the Snowflake Go Driver.
> 3,000 bind array inserts is taking 3 hours..
>
>
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2023 BlackRock, Inc. All rights reserved.
>


Re: Apache Arrow file format

2023-10-19 Thread Jacek Pliszka
There is a note there explaining what they understand by it but
further down the line they do not make such distinction.

The fact that parquet can be better in-memory format than arrow for
certain common uses is something I haven't thought of
and is eye-opening for me, admittedly so because I am not up to date
on the topic.

On the other hand it shows which parquet features can be useful for
arrow to match the performance.

BR,

Jacek

czw., 19 paź 2023 o 00:17 Antoine Pitrou  napisał(a):
>
>
> The fact that they describe Arrow and Feather as distinct formats
> (they're not!) with different characteristics is a bit of a bummer.
>
>
> Le 18/10/2023 à 22:20, Andrew Lamb a écrit :
> > If you are looking for a more formal discussion and empirical analysis of
> > the differences, I suggest reading "A Deep Dive into Common Open Formats
> > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> > compares and contrasts Arrow, Parquet, ORC and Feather file formats.
> >
> > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
> >
> > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
> >  wrote:
> >
> >> To further what others have already mentioned, the IPC file format is
> >> primarily optimised for IPC use-cases, that is exchanging the entire
> >> contents between processes. It is relatively inexpensive to encode and
> >> decode, and supports all arrow datatypes, making it ideal for things
> >> like spill-to-disk processing, distributed shuffles, etc...
> >>
> >> Parquet by comparison is a storage format, optimised for space
> >> efficiency and selective querying, with [1] containing an overview of
> >> the various techniques the format affords. It is comparatively expensive
> >> to encode and decode, and instead relies on index structures and
> >> statistics to accelerate access.
> >>
> >> Both are therefore perfectly viable options depending on your particular
> >> use-case.
> >>
> >> [1]:
> >>
> >> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >>
> >> On 18/10/2023 13:59, Dewey Dunnington wrote:
> >>> Plenty of opinions here already, but I happen to think that IPC
> >>> streams and/or Arrow File/Feather are wildly underutilized. For the
> >>> use-case where you're mostly just going to read an entire file into R
> >>> or Python it's a bit faster (and far superior to a CSV or pickling or
> >>> .rds files in R).
> >>>
>  you're going to read all the columns for a record batch in the file, no
> >> matter what
> >>> The metadata for each every column in every record batch has to be
> >>> read, but there's nothing inherent about the format that prevents
> >>> selectively loading into memory only the required buffers. (I don't
> >>> know off the top of my head if any reader implementation actually does
> >>> this).
> >>>
> >>> On Wed, Oct 18, 2023 at 12:02 AM wish maple 
> >> wrote:
>  Arrow IPC file is great, it focuses on in-memory representation and
> >> direct
>  computation.
>  Basically, it can support compression and dictionary encoding, and can
>  zero-copy
>  deserialize the file to memory Arrow format.
> 
>  Parquet provides some strong functionality, like Statistics, which could
>  help pruning
>  unnecessary data during scanning and avoid cpu and io cust. And it has
> >> high
>  efficient
>  encoding, which could make the Parquet file smaller than the Arrow IPC
> >> file
>  under the same
>  data. However, currently some arrow data type cannot be convert to
>  correspond Parquet type
>  in the current arrow-cpp implementation. You can goto the arrow
> >> document to
>  take a look.
> 
>  Adam Lippai  于2023年10月18日周三 10:50写道:
> 
> > Also there is
> > https://github.com/lancedb/lance between the two formats. Depending
> >> on the
> > use case it can be a great choice.
> >
> > Best regards
> > Adam Lippai
> >
> > On Tue, Oct 17, 2023 at 22:44 Matt Topol 
> >> wrote:
> >
> >> One benefit of the feather format (i.e. Arrow IPC file format) is the
> >> ability to mmap the file to easily handle reading sections of a larger
> > than
> >> memory file of data. Since, as Felipe mentioned, the format is
> >> focused on
> >> in-memory representation, you can easily and simply mmap the file and
> >> use
> >> the raw bytes directly. For a large file that you only want to read
> >> sections of, this can be beneficial for IO and memory usage.
> >>
> >> Unfortunately, you are correct that it doesn't allow for easy column
> >> projecting (you're going to read all the columns for a record batch in
> > the
> >> file, no matter what). So it's going to be a trade off based on your
> > needs
> >> as to whether it makes sense, or if you should use a file format like
> >> Parquet instead.
> >>
> >> -Matt
> >>
> >>
> >> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> 

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-05-03 Thread Jacek Pliszka
Good idea, I've created JIRA issue:

https://issues.apache.org/jira/browse/ARROW-12637

And named it range to avoid confusion with intervals...
Though confusion will stay as it is called interval in Pandas and in
logic (Allen's interval algebra)

BR,

Jacek

pon., 3 maj 2021 o 18:05 Micah Kornfield  napisał(a):
>
> Hi Jacek,
> This seems like reasonable functionality.  I think the probably comes in
> two parts:
> 1.  This might be a good candidate for a "Well Known"/Officially supported
> Extension type. I can think of a few different representations but I would
> guess something like Struct[start: T, struct: end]] with well defined
> extension metadata to define open/closed on start and end might be the best
> (we should probably spin this off into a separate discussion thread).
> 2.  Adding the right computation Kernels to work with the type.
>
> Do you want to start a new thread or open up some JIRAs to track this work?
>
> Thanks,
> Micah
>
> On Mon, May 3, 2021 at 5:32 AM Jacek Pliszka 
> wrote:
>
> > Sorry, my mistake.
> >
> > You are right - I meant anchored intervals as in pandas - ones with
> > defined start and end - and I think many future users will make the
> > same mistake.
> >
> > I would love to be able to do fast overlap joins on arrow level.
> >
> > Best Regards,
> >
> > Jacek
> >
> >
> >
> >
> > niedz., 2 maj 2021 o 23:06 Wes McKinney  napisał(a):
> > >
> > > I also don't understand the comment about closed / open / semi-open
> > > intervals. Perhaps there is a confusion, since "interval" as we mean
> > > it here is called a "time delta" in some other projects. An interval
> > > here does not refer to a time span with a distinct start and end point
> > > (I understand this might be confusing to a pandas user since pandas
> > > has an interval data type where each value is a tuple of arbitrary
> > > start/end).
> > >
> > > On Sun, May 2, 2021 at 3:46 PM Micah Kornfield 
> > wrote:
> > > >
> > > > Hi Jacek,
> > > > I'm not sure I fully understand the proposal, could you elaborate with
> > more
> > > > examples/details?  For instance DAY_TIME isn't just a UINT64, it
> > actually
> > > > contains 2 seperate fields (days and milliseconds).
> > > >
> > > > In terms of closed vs half-open, in my limited understanding, that is
> > more
> > > > a concern of functions using interval types rather than the type
> > itself.
> > > > For instance a quick search of postgres [1] docs only talks about
> > half-open
> > > > in relation to the "Overlaps" operator
> > > >
> > > > Thanks,
> > > > -Micah
> > > >
> > > > [1] https://www.postgresql.org/docs/9.1/functions-datetime.html
> > > >
> > > >
> > > >
> > > > On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka  > >
> > > > wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > I wonder if it were possible to have generic interval with integers
> > of
> > > > > specified size just to have common base for interval arithmetic.
> > > > >
> > > > > Then user can convert their period to ordinals and use the arithmetic
> > > > > (joining, deoverlapping, common parts, explosion etc.).
> > > > >
> > > > > So YEAR_MONTH and DAY_TIME would be just special cases of
> > > > > INTERVAL_UINT32 and INTERVAL_UINT64
> > > > >
> > > > > Also I believe it is worth to state whether there are only closed
> > > > > intervals or open/semi-open ones are allowed as well.
> > > > >
> > > > > I believe I am just one of many reinventing the wheel here and
> > writing
> > > > > own versions of the above.
> > > > >
> > > > > BR,
> > > > >
> > > > > Jacek
> > > > >
> > > > >
> > > > > pt., 2 kwi 2021 o 21:53 Micah Kornfield 
> > > > > napisał(a):
> > > > > >
> > > > > > Andrew is the use-case you have simply postgres compatibility or
> > is it
> > > > > more
> > > > > > extensive?
> > > > > >
> > > > > > One potential problem with combining Month and Day fields, is that
> > the
> > > > > type
> > > > > &

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-05-03 Thread Jacek Pliszka
Sorry, my mistake.

You are right - I meant anchored intervals as in pandas - ones with
defined start and end - and I think many future users will make the
same mistake.

I would love to be able to do fast overlap joins on arrow level.

Best Regards,

Jacek




niedz., 2 maj 2021 o 23:06 Wes McKinney  napisał(a):
>
> I also don't understand the comment about closed / open / semi-open
> intervals. Perhaps there is a confusion, since "interval" as we mean
> it here is called a "time delta" in some other projects. An interval
> here does not refer to a time span with a distinct start and end point
> (I understand this might be confusing to a pandas user since pandas
> has an interval data type where each value is a tuple of arbitrary
> start/end).
>
> On Sun, May 2, 2021 at 3:46 PM Micah Kornfield  wrote:
> >
> > Hi Jacek,
> > I'm not sure I fully understand the proposal, could you elaborate with more
> > examples/details?  For instance DAY_TIME isn't just a UINT64, it actually
> > contains 2 seperate fields (days and milliseconds).
> >
> > In terms of closed vs half-open, in my limited understanding, that is more
> > a concern of functions using interval types rather than the type itself.
> > For instance a quick search of postgres [1] docs only talks about half-open
> > in relation to the "Overlaps" operator
> >
> > Thanks,
> > -Micah
> >
> > [1] https://www.postgresql.org/docs/9.1/functions-datetime.html
> >
> >
> >
> > On Sun, May 2, 2021 at 12:25 AM Jacek Pliszka 
> > wrote:
> >
> > > Hi!
> > >
> > > I wonder if it were possible to have generic interval with integers of
> > > specified size just to have common base for interval arithmetic.
> > >
> > > Then user can convert their period to ordinals and use the arithmetic
> > > (joining, deoverlapping, common parts, explosion etc.).
> > >
> > > So YEAR_MONTH and DAY_TIME would be just special cases of
> > > INTERVAL_UINT32 and INTERVAL_UINT64
> > >
> > > Also I believe it is worth to state whether there are only closed
> > > intervals or open/semi-open ones are allowed as well.
> > >
> > > I believe I am just one of many reinventing the wheel here and writing
> > > own versions of the above.
> > >
> > > BR,
> > >
> > > Jacek
> > >
> > >
> > > pt., 2 kwi 2021 o 21:53 Micah Kornfield 
> > > napisał(a):
> > > >
> > > > Andrew is the use-case you have simply postgres compatibility or is it
> > > more
> > > > extensive?
> > > >
> > > > One potential problem with combining Month and Day fields, is that the
> > > type
> > > > no longer has a defined sort order (the existing Day-Millisecond type
> > > > without assumptions, in particular because I don't think today there is
> > > an
> > > > explicit constraint on the bounds for the millisecond component).
> > > >
> > > > -Micah
> > > >
> > > >
> > > >
> > > > On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > > >
> > > > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
> > > > > > Thanks for the feedback.  A couple of points here and some responses
> > > > > below.
> > > > > >
> > > > > > * One other question is whether the Nanoseconds should actually be
> > > > > > configurable (i.e. use milliseconds or microseconds).  I would lean
> > > > > towards
> > > > > > no.
> > > > >
> > > > > Same for me.
> > > > >
> > > > > > * I'm also still not 100% convinced we need this as a first class
> > > type in
> > > > > > arrow or if we should be looking more closely at the Struct (in the
> > > Arrow
> > > > > > sense) based implementation.  In the future where alternative
> > > encodings
> > > > > are
> > > > > > supported, this could allow for much smaller footprints for this
> > > type.
> > > > >
> > > > > Having a "packed" first class type allows for better locality when
> > > > > accessing data.  It doesn't sound very likely that you'd access only
> > > one
> > > > > component of the interval.
> > > > >
> > > > > But I have no idea how important this is, and temporal d

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-05-02 Thread Jacek Pliszka
Hi!

I wonder if it were possible to have generic interval with integers of
specified size just to have common base for interval arithmetic.

Then user can convert their period to ordinals and use the arithmetic
(joining, deoverlapping, common parts, explosion etc.).

So YEAR_MONTH and DAY_TIME would be just special cases of
INTERVAL_UINT32 and INTERVAL_UINT64

Also I believe it is worth to state whether there are only closed
intervals or open/semi-open ones are allowed as well.

I believe I am just one of many reinventing the wheel here and writing
own versions of the above.

BR,

Jacek


pt., 2 kwi 2021 o 21:53 Micah Kornfield  napisał(a):
>
> Andrew is the use-case you have simply postgres compatibility or is it more
> extensive?
>
> One potential problem with combining Month and Day fields, is that the type
> no longer has a defined sort order (the existing Day-Millisecond type
> without assumptions, in particular because I don't think today there is an
> explicit constraint on the bounds for the millisecond component).
>
> -Micah
>
>
>
> On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou  wrote:
>
> >
> > Le 31/03/2021 à 17:55, Micah Kornfield a écrit :
> > > Thanks for the feedback.  A couple of points here and some responses
> > below.
> > >
> > > * One other question is whether the Nanoseconds should actually be
> > > configurable (i.e. use milliseconds or microseconds).  I would lean
> > towards
> > > no.
> >
> > Same for me.
> >
> > > * I'm also still not 100% convinced we need this as a first class type in
> > > arrow or if we should be looking more closely at the Struct (in the Arrow
> > > sense) based implementation.  In the future where alternative encodings
> > are
> > > supported, this could allow for much smaller footprints for this type.
> >
> > Having a "packed" first class type allows for better locality when
> > accessing data.  It doesn't sound very likely that you'd access only one
> > component of the interval.
> >
> > But I have no idea how important this is, and temporal datetypes are
> > generally cumbersome to add support for (conversions, arithmetic, etc.),
> > so it would be nice to avoid adding too many of them :-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > >
> > > The 3
> > >> field implementation doesn't seem to have any way to represent integral
> > >> days, so I am also not sure about that one.
> > >
> > >
> > > Sorry this was an email gaffe.  I intended Month (32 bit int), Day (32
> > bit
> > > int), Nanosecond (64 bit int).
> > >
> > > OTOH I don't really understand the point of supporting "the most
> > >> reasonable ranges for Year, Month and Nanoseconds independently".  What
> > >> does it bring to encode more than one month in the nanoseconds field?
> > >
> > >
> > > I'm happy with simplicity.   In the past there has been some reference to
> > > people wanting to store very large timestamps (fall out of Nanoseconds
> > max
> > > representable value) but we've concluded that this wasn't something that
> > we
> > > wanted to really support.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Mar 31, 2021 at 4:49 AM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >> I would favour the following characteristics :
> > >> - support for nanoseconds (especially as other Arrow temporal types
> > >> support it)
> > >> - easy to handle (which excludes the ZetaSQL representtaion IMHO)
> > >>
> > >> OTOH I don't really understand the point of supporting "the most
> > >> reasonable ranges for Year, Month and Nanoseconds independently".  What
> > >> does it bring to encode more than one month in the nanoseconds field?
> > >> You can already use the Duration type for that.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >> Le 31/03/2021 à 05:48, Micah Kornfield a écrit :
> > >>> To follow-up on this conversation I did some analysis on interval
> > types:
> > >>>
> > >>>
> > >>
> > https://docs.google.com/document/d/1i1E_fdQ_xODZcAhsV11Pfq27O50k679OYHXFJpm9NS0/edit
> > >> Please feel free to add more details/systems I missed.
> > >>>
> > >>> Given the disparate requirements of different systems I think the
> > >> following might make sense for official types (if there isn't
> > consensus, I
> > >> might try to contributation extension Array implementations for them to
> > >> Java and C++/Python separately).
> > >>>
> > >>> 1.  3 fields: Year (32 bit), Month (32 bit), Nanoseconds (64 bit) all
> > >> signed.
> > >>> 2.  Postgres representation (Downside is it doesn't support
> > Nanoseconds,
> > >> only microseconds).
> > >>> 3.  ZetaSQL implementation (Requires some bit manipulation) but
> > supports
> > >> the most reasonable ranges for Year, Month and Nanoseconds
> > independently.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>> Micah
> > >>>
> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote:
> > >
> > > I didn’t find any page/documentation on how to do RFC in Arrow
> > >> protocol,
> > > so can anyone point me to it or PR with email will be enough?
> > 
> > 

Re: Decimal128 scale limits

2020-07-01 Thread Jacek Pliszka
Hi!

I am aware about at least 2  different decimal128 things:

a) the one we have - where we use 128 bits to store integer which is
later shifted by scale - 38 is number of digits of significand i.e.
digits fitting in 128 bits
(2**128/10**38) - IMHO it is completely unrelated to scale which we
store separately

b) IEEE 754 one which has exponent from -6143 to +6144

BR,

Jacek

śr., 1 lip 2020 o 16:16 Antoine Pitrou  napisał(a):
>
>
> Hello,
>
> Are there limits to the value of the scale for either decimal128 or
> decimal?  Can it be negative?  Can it be greater than 38 (and/or lower
> than -38)?
>
> It's not clear from looking either at the spec or at the C++ code...
>
> Regards
>
> Antoine.


Re: [Python] black vs. autopep8

2020-04-02 Thread Jacek Pliszka
Hi!

I believe amount of changes is not that important.

In my opinion, what matters is which format will allow reviewers to be
more efficient.

The committer can always reformat as they like. It is harder for the reviewer.

BR,

Jacek

czw., 2 kwi 2020 o 15:32 Antoine Pitrou  napisał(a):
>
>
> PS: in both cases, Cython files are not processed.  autopep8 is actually
> able to process them, but the comparison wouldn't be apples-to-apples.
>
> (that said, autopep8 gives suboptimal results on Cython files, for
> example it changes "_variable" to "& c_variable" and
> "void* ptr" to "void * ptr")
>
> Regards
>
> Antoine.
>
> Le 02/04/2020 à 15:30, Antoine Pitrou a écrit :
> >
> > Hello,
> >
> > I've put up two PRs to compare the effect of running black vs. autopep8
> > on the Python codebase.
> >
> > * black: https://github.com/apache/arrow/pull/6810
> >  65 files changed, 7855 insertions(+), 5215 deletions(-)
> >
> > * autopep8: https://github.com/apache/arrow/pull/6811
> >  20 files changed, 137 insertions(+), 118 deletions(-)
> >
> > I've configured black to try and minimize changes (for example, avoid
> > normalizing string quoting style).  Still, the number of changes is
> > humongous and they add 2600 lines to the codebase (which is a tangible
> > amount of vertical space).
> >
> > Regards
> >
> > Antoine.
> >


Re: Using GitHub Actions to automate style and other fixes

2020-02-20 Thread Jacek Pliszka
As a beginner contributor I believe I can vote for linting as part of the build.

For me the best would be BEGINNER/ALL_CHECKS option in the Makefile
that does all the linting and all checks done in the build.

And in the instruction it would be clearly suggested to use it.

BR,

Jacek


śr., 19 lut 2020 o 20:40 Antoine Pitrou  napisał(a):
>
>
> Hi,
>
> On Wed, 19 Feb 2020 09:59:04 -0800
> >
> > It doesn't have to be this way. With GitHub Actions, we can run workflows
> > that fix style and other violations and push the fix in a commit back to
> > the branch.
>
> I'm rather opposed to this.  Doing automated pushes behind the user's
> back will feel confusing and slightly obnoxious.
>
> > Style guides and linting are important for large projects like Arrow, but
> > we don't want to add unnecessary friction to the dev process, particularly
> > for new contributors--it's challenging enough without it.
>
> Well, at worse, we can push fixes ourselves before merging a PR if the
> only remaining ones are style fixes.
>
> > What are your thoughts? If anyone objects to using GitHub Actions in this
> > way, would you be satisfied with blacklisting your fork (i.e. you don't
> > want it running on your branches but you don't mind if others do)?
>
> Egoistically, that would satify me, but I'm not sure it would be less
> confusing to beginner contributors.
>
> Regards
>
> Antoine.
>
>


Re: [ARROW-3329] Re: Decimal casting or scaling

2020-02-12 Thread Jacek Pliszka
Actually these options still make some sense - but not as much as before.

The use case: unit conversion

Data about prices exported from sql in Decimal(38,10) which uses 128
bit but the numbers are actually prices which expressed in cents fit
perfectly in uint32

Having scaling would reduce bandwidth/disk usage by factor of 4.

What would be the best approach to such use case?

Would decimal_scale CastOption be OK or should it rather be compute
'multiply' kernel ?

BR,

Jacek


śr., 12 lut 2020 o 19:32 Jacek Pliszka  napisał(a):
>
> OK, then what I proposed does not make sense and I can just copy the
> solution you pointed out.
>
> Thank you,
>
> Jacek
>
> śr., 12 lut 2020 o 19:27 Wes McKinney  napisał(a):
> >
> > On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka  
> > wrote:
> > >
> > > Hi!
> > >
> > > ARROW-3329 - we can discuss there.
> > >
> > > > It seems like it makes sense to implement both lossless safe casts
> > > > (when all zeros after the decimal point) and lossy casts (fractional
> > > > part discarded) from decimal to integer, do I have that right?
> > >
> > > Yes, though if I understood your examples are the same case - in both
> > > cases fractional part is discarded - just it is all 0s in the first
> > > case.
> > >
> > > The key question is whether CastFunctor in cast.cc has access to scale
> > > of the decimal? If yes how?
> >
> > Yes, it's in the type of the input array. Here's a kernel
> > implementation that uses the TimestampType metadata of the input
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521
> >
> > >
> > > If not - these are the options I've came up with:
> > >
> > > Let's assume Decimal128Type value is  n
> > >
> > > Then I expect that base call
> > > .cast('int64') will return  overflow for n beyond int64 values, value 
> > > otherwise
> > >
> > > Option 1:
> > >
> > > .cast('int64', decimal_scale=s)  would calculate  n/10**s and return
> > > overflow if it is beyond int64, value otherwise
> > >
> > > Option 2:
> > >
> > > .cast('int64', bytes_group=0) would return n & 0x
> > > .cast('int64', bytes_group=1) would return (n >> 64) & 0x
> > > .cast('int64') would have default value bytes_group=0
> > >
> > > Option 3:
> > >
> > > cast has no CastOptions but we add  multiply compute kernel and have
> > > something like this instead:
> > >
> > > .compute('multiply', 10**-s).cast('int64')
> > >
> > > BR,
> > >
> > > Jacek


Re: [ARROW-3329] Re: Decimal casting or scaling

2020-02-12 Thread Jacek Pliszka
OK, then what I proposed does not make sense and I can just copy the
solution you pointed out.

Thank you,

Jacek

śr., 12 lut 2020 o 19:27 Wes McKinney  napisał(a):
>
> On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka  
> wrote:
> >
> > Hi!
> >
> > ARROW-3329 - we can discuss there.
> >
> > > It seems like it makes sense to implement both lossless safe casts
> > > (when all zeros after the decimal point) and lossy casts (fractional
> > > part discarded) from decimal to integer, do I have that right?
> >
> > Yes, though if I understood your examples are the same case - in both
> > cases fractional part is discarded - just it is all 0s in the first
> > case.
> >
> > The key question is whether CastFunctor in cast.cc has access to scale
> > of the decimal? If yes how?
>
> Yes, it's in the type of the input array. Here's a kernel
> implementation that uses the TimestampType metadata of the input
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521
>
> >
> > If not - these are the options I've came up with:
> >
> > Let's assume Decimal128Type value is  n
> >
> > Then I expect that base call
> > .cast('int64') will return  overflow for n beyond int64 values, value 
> > otherwise
> >
> > Option 1:
> >
> > .cast('int64', decimal_scale=s)  would calculate  n/10**s and return
> > overflow if it is beyond int64, value otherwise
> >
> > Option 2:
> >
> > .cast('int64', bytes_group=0) would return n & 0x
> > .cast('int64', bytes_group=1) would return (n >> 64) & 0x
> > .cast('int64') would have default value bytes_group=0
> >
> > Option 3:
> >
> > cast has no CastOptions but we add  multiply compute kernel and have
> > something like this instead:
> >
> > .compute('multiply', 10**-s).cast('int64')
> >
> > BR,
> >
> > Jacek


[ARROW-3329] Re: Decimal casting or scaling

2020-02-12 Thread Jacek Pliszka
Hi!

ARROW-3329 - we can discuss there.

> It seems like it makes sense to implement both lossless safe casts
> (when all zeros after the decimal point) and lossy casts (fractional
> part discarded) from decimal to integer, do I have that right?

Yes, though if I understood your examples are the same case - in both
cases fractional part is discarded - just it is all 0s in the first
case.

The key question is whether CastFunctor in cast.cc has access to scale
of the decimal? If yes how?

If not - these are the options I've came up with:

Let's assume Decimal128Type value is  n

Then I expect that base call
.cast('int64') will return  overflow for n beyond int64 values, value otherwise

Option 1:

.cast('int64', decimal_scale=s)  would calculate  n/10**s and return
overflow if it is beyond int64, value otherwise

Option 2:

.cast('int64', bytes_group=0) would return n & 0x
.cast('int64', bytes_group=1) would return (n >> 64) & 0x
.cast('int64') would have default value bytes_group=0

Option 3:

cast has no CastOptions but we add  multiply compute kernel and have
something like this instead:

.compute('multiply', 10**-s).cast('int64')

BR,

Jacek


Decimal casting or scaling

2020-02-12 Thread Jacek Pliszka
Hi!

I am interested in having cast from Decimal to Int in pyarrow.

I have couple ideas but I am a newbie so I might be wrong:

Do I understand correctly that the problem lies in the fact that
CastFunctor knows nothing about decimal scale?

Were there any ideas how to handle this properly?

My ideas are not that great but maybe one of them would be OK:

1. We can pass 'numeric_scale_shift' or 'decimal_scale_shift' in CastOptions.
Then while casting, then numbers would be scaled properly.

2. Pass byte group selector in CastOptions i.e. when casting from
N*M bytes to N bytes we can pick any of the M groups

3. Do not modify cast but add scale/multiply compute kernel so we can
apply it explicitly prior
to casting

What do you think?  I like 1 most 2 least.

Would any of this solutions be accepted into the code?

I do not want to work on something that would be rejected immediately...

Thanks for any input provided,

Jacek


Re: Help with decimal ArrayData needed

2020-02-11 Thread Jacek Pliszka
Issue solved, sorry for noise - is_valid was false for 2nd value.

BR.

Jacek

wt., 11 lut 2020 o 21:43 Jacek Pliszka  napisał(a):

> Hi!
>
> I am an arrow newbie trying to implement cast from Decimal128 to Int64 and
> I need some help.
>
> In unit test I am passing
> std::vector v
> to
> CheckCase(decimal(38, 10),
> v, is_valid, int64(), e,
> options);
>
> But when I try to read it in
> CastFunctor
> with
> auto ptr = input.GetValues(1) where in_type is Decimal128
>
> then ptr[0] shows correctly v[0]
> but ptr[1] shows 0 instead of v[1] = 30
>
> What am I doing wrong?
>
> Tasks seems simple enough for me to finish once I solve this and find
> scale proper scale information.
>
> Thanks for any help,
>
> Jacek
>
>
>
>


Help with decimal ArrayData needed

2020-02-11 Thread Jacek Pliszka
Hi!

I am an arrow newbie trying to implement cast from Decimal128 to Int64 and
I need some help.

In unit test I am passing
std::vector v
to
CheckCase(decimal(38, 10),
v, is_valid, int64(), e,
options);

But when I try to read it in
CastFunctor
with
auto ptr = input.GetValues(1) where in_type is Decimal128

then ptr[0] shows correctly v[0]
but ptr[1] shows 0 instead of v[1] = 30

What am I doing wrong?

Tasks seems simple enough for me to finish once I solve this and find
scale proper scale information.

Thanks for any help,

Jacek