[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1425: Fix Version/s: (was: 0.8.0) 0.9.0 > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219917#comment-16219917 ] Wes McKinney commented on ARROW-1425: - It seems there is still too much in flux on Spark side. Moving this to the next milestone > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1555) [Python] write_to_dataset on s3
[ https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219913#comment-16219913 ] ASF GitHub Bot commented on ARROW-1555: --- wesm commented on issue #1240: ARROW-1555 [Python] Implement Dask exists function URL: https://github.com/apache/arrow/pull/1240#issuecomment-339542169 Build looks ok, the failure is unrelated (I restarted the failing job anyway so we can get a green build) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] write_to_dataset on s3 > --- > > Key: ARROW-1555 > URL: https://issues.apache.org/jira/browse/ARROW-1555 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Young-Jun Ko >Assignee: Florian Jetter >Priority: Trivial > Labels: pull-request-available > Fix For: 0.8.0 > > > When writing a arrow table to s3, I get an NotImplemented Exception. > The root cause is in _ensure_filesystem and can be reproduced as follows: > import pyarrow > import pyarrow.parquet as pqa > import s3fs > s3 = s3fs.S3FileSystem() > pqa._ensure_filesystem(s3).exists("anything") > It appears that the S3FSWrapper that is instantiated in _ensure_filesystem > does not expose the exist method of s3. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1555) [Python] write_to_dataset on s3
[ https://issues.apache.org/jira/browse/ARROW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219911#comment-16219911 ] ASF GitHub Bot commented on ARROW-1555: --- wesm commented on a change in pull request #1240: ARROW-1555 [Python] Implement Dask exists function URL: https://github.com/apache/arrow/pull/1240#discussion_r147039732 ## File path: python/pyarrow/filesystem.py ## @@ -135,6 +135,12 @@ def isfile(self, path): """ raise NotImplementedError +def isfilestore(self): Review comment: Can you make this a private API (`_isfilestore`)? Unclear if normal users would need this This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] write_to_dataset on s3 > --- > > Key: ARROW-1555 > URL: https://issues.apache.org/jira/browse/ARROW-1555 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Young-Jun Ko >Assignee: Florian Jetter >Priority: Trivial > Labels: pull-request-available > Fix For: 0.8.0 > > > When writing a arrow table to s3, I get an NotImplemented Exception. > The root cause is in _ensure_filesystem and can be reproduced as follows: > import pyarrow > import pyarrow.parquet as pqa > import s3fs > s3 = s3fs.S3FileSystem() > pqa._ensure_filesystem(s3).exists("anything") > It appears that the S3FSWrapper that is instantiated in _ensure_filesystem > does not expose the exist method of s3. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1133) [C++] Convert all non-accessor function names to PascalCase
[ https://issues.apache.org/jira/browse/ARROW-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1133: Fix Version/s: (was: 0.8.0) 1.0.0 > [C++] Convert all non-accessor function names to PascalCase > --- > > Key: ARROW-1133 > URL: https://issues.apache.org/jira/browse/ARROW-1133 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney > Fix For: 1.0.0 > > > It seems Google has taken the "cheap functions can be lower case" out of > their style guide. I've been asked enough about "which style to use" that I > like the idea of UsePascalCaseForEverything > https://github.com/google/styleguide/commit/db0a26320f3e930c6ea7225ed53539b4fb31310c#diff-26120df7bca3279afbf749017c778545R4277 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1728) [C++] Run clang-format checks in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219901#comment-16219901 ] ASF GitHub Bot commented on ARROW-1728: --- wesm commented on issue #1251: ARROW-1728: [C++] Run clang-format checks in Travis CI URL: https://github.com/apache/arrow/pull/1251#issuecomment-339540207 Alright, we are looking good: ``` Scanning dependencies of target check-format clang-format checks failed, run 'make format' to fix make[3]: *** [CMakeFiles/check-format] Error 255 make[2]: *** [CMakeFiles/check-format.dir/all] Error 2 make[1]: *** [CMakeFiles/check-format.dir/rule] Error 2 make: *** [check-format] Error 2 ``` I'll revert the flake and if others are in agreement about failing on clang-format issues we can merge this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Run clang-format checks in Travis CI > -- > > Key: ARROW-1728 > URL: https://issues.apache.org/jira/browse/ARROW-1728 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > I think it's reasonable to expect contributors to run clang-format on their > code. This may lead to a higher number of failed builds but will eliminate > noise diffs in unrelated patches -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1646) [Python] pyarrow.array cannot handle NumPy scalar types
[ https://issues.apache.org/jira/browse/ARROW-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1646: Fix Version/s: (was: 0.8.0) 0.9.0 > [Python] pyarrow.array cannot handle NumPy scalar types > --- > > Key: ARROW-1646 > URL: https://issues.apache.org/jira/browse/ARROW-1646 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.9.0 > > > Example repro > {code} > In [1]: import pyarrow as pa > impo > In [2]: import numpy as np > In [3]: pa.array([np.random.randint(0, 10, size=5), None]) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.array([np.random.randint(0, 10, size=5), None]) > /home/wesm/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:24892)() > 171 if mask is not None: > 172 raise ValueError("Masks only supported with ndarray-like > inputs") > --> 173 return _sequence_to_array(obj, size, type, pool) > 174 > 175 > /home/wesm/code/arrow/python/pyarrow/array.pxi in > pyarrow.lib._sequence_to_array > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:23496)() > 23 if type is None: > 24 with nogil: > ---> 25 check_status(ConvertPySequence(sequence, pool, &out)) > 26 else: > 27 if size is None: > /home/wesm/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:7876)() > 75 message = frombytes(status.message()) > 76 if status.IsInvalid(): > ---> 77 raise ArrowInvalid(message) > 78 elif status.IsIOError(): > 79 raise ArrowIOError(message) > ArrowInvalid: > /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:740 code: > InferArrowTypeAndSize(obj, &size, &type) > /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:319 code: > InferArrowType(obj, out_type) > /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:299 code: > seq_visitor.Visit(obj) > /home/wesm/code/arrow/cpp/src/arrow/python/builtin_convert.cc:180 code: > VisitElem(ref, level) > Error inferring Arrow data type for collection of Python objects. Got Python > object of type ndarray but can only handle these types: bool, float, integer, > date, datetime, bytes, unicode > {code} > If these inner values are converted to Python built-in int types then it > works fine -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False
[ https://issues.apache.org/jira/browse/ARROW-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219886#comment-16219886 ] ASF GitHub Bot commented on ARROW-1732: --- wesm opened a new pull request #1252: ARROW-1732: [Python] Permit creating record batches with no columns, test pandas roundtrips URL: https://github.com/apache/arrow/pull/1252 I ran into this rough edge today, invariably serialization code paths will need to send across a DataFrame with no columns, this will need to work even if `preserve_index=False` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] RecordBatch.from_pandas fails on DataFrame with no columns when > preserve_index=False > - > > Key: ARROW-1732 > URL: https://issues.apache.org/jira/browse/ARROW-1732 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > I believe this should have well-defined behavior and not raise an error: > {code} > In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) > --- > ValueErrorTraceback (most recent call last) > in () > > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) > ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)() > 586 df, schema, preserve_index, nthreads=nthreads > 587 ) > --> 588 return cls.from_arrays(arrays, names, metadata) > 589 > 590 @staticmethod > ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)() > 615 > 616 if not number_of_arrays: > --> 617 raise ValueError('Record batch cannot contain no arrays > (for now)') > 618 > 619 num_rows = len(arrays[0]) > ValueError: Record batch cannot contain no arrays (for now) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False
[ https://issues.apache.org/jira/browse/ARROW-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1732: -- Labels: pull-request-available (was: ) > [Python] RecordBatch.from_pandas fails on DataFrame with no columns when > preserve_index=False > - > > Key: ARROW-1732 > URL: https://issues.apache.org/jira/browse/ARROW-1732 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > I believe this should have well-defined behavior and not raise an error: > {code} > In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) > --- > ValueErrorTraceback (most recent call last) > in () > > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) > ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)() > 586 df, schema, preserve_index, nthreads=nthreads > 587 ) > --> 588 return cls.from_arrays(arrays, names, metadata) > 589 > 590 @staticmethod > ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)() > 615 > 616 if not number_of_arrays: > --> 617 raise ValueError('Record batch cannot contain no arrays > (for now)') > 618 > 619 num_rows = len(arrays[0]) > ValueError: Record batch cannot contain no arrays (for now) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False
[ https://issues.apache.org/jira/browse/ARROW-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1732: --- Assignee: Wes McKinney > [Python] RecordBatch.from_pandas fails on DataFrame with no columns when > preserve_index=False > - > > Key: ARROW-1732 > URL: https://issues.apache.org/jira/browse/ARROW-1732 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > I believe this should have well-defined behavior and not raise an error: > {code} > In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) > --- > ValueErrorTraceback (most recent call last) > in () > > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) > ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)() > 586 df, schema, preserve_index, nthreads=nthreads > 587 ) > --> 588 return cls.from_arrays(arrays, names, metadata) > 589 > 590 @staticmethod > ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays > (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)() > 615 > 616 if not number_of_arrays: > --> 617 raise ValueError('Record batch cannot contain no arrays > (for now)') > 618 > 619 num_rows = len(arrays[0]) > ValueError: Record batch cannot contain no arrays (for now) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-842) [Python] Handle more kinds of null sentinel objects from pandas 0.x
[ https://issues.apache.org/jira/browse/ARROW-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-842: --- Fix Version/s: (was: 0.8.0) 0.9.0 > [Python] Handle more kinds of null sentinel objects from pandas 0.x > --- > > Key: ARROW-842 > URL: https://issues.apache.org/jira/browse/ARROW-842 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Fix For: 0.9.0 > > > Follow-on work to ARROW-707. See > https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L193 > and discussion in https://github.com/apache/arrow/pull/554 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-842) [Python] Handle more kinds of null sentinel objects from pandas 0.x
[ https://issues.apache.org/jira/browse/ARROW-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219873#comment-16219873 ] Wes McKinney commented on ARROW-842: This might wait for more general tooling around NumPy scalar types. See also ARROW-1646 > [Python] Handle more kinds of null sentinel objects from pandas 0.x > --- > > Key: ARROW-842 > URL: https://issues.apache.org/jira/browse/ARROW-842 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Fix For: 0.9.0 > > > Follow-on work to ARROW-707. See > https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L193 > and discussion in https://github.com/apache/arrow/pull/554 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1524) [C++] More graceful solution for handling non-zero offsets on inputs and outputs in compute library
[ https://issues.apache.org/jira/browse/ARROW-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1524. - Resolution: Fixed Resolved in https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c > [C++] More graceful solution for handling non-zero offsets on inputs and > outputs in compute library > --- > > Key: ARROW-1524 > URL: https://issues.apache.org/jira/browse/ARROW-1524 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > Currently we must remember to shift by the offset. We should add some inline > utility functions to centralize this logic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1482) [C++] Implement casts between date32 and date64
[ https://issues.apache.org/jira/browse/ARROW-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1482. - Resolution: Fixed Resolved in https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c > [C++] Implement casts between date32 and date64 > --- > > Key: ARROW-1482 > URL: https://issues.apache.org/jira/browse/ARROW-1482 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1672) [Python] Failure to write Feather bytes column
[ https://issues.apache.org/jira/browse/ARROW-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1672. - Resolution: Fixed Resolved in https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855 > [Python] Failure to write Feather bytes column > -- > > Key: ARROW-1672 > URL: https://issues.apache.org/jira/browse/ARROW-1672 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > See bug report in https://github.com/wesm/feather/issues/320 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1680) [Python] Timestamp unit change not done in from_pandas() conversion
[ https://issues.apache.org/jira/browse/ARROW-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1680. - Resolution: Fixed Resolved in https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c > [Python] Timestamp unit change not done in from_pandas() conversion > --- > > Key: ARROW-1680 > URL: https://issues.apache.org/jira/browse/ARROW-1680 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Assignee: Wes McKinney > Fix For: 0.8.0 > > > When calling {{Array.from_pandas}} with a pandas.Series of timestamps that > have 'ns' unit and specifying a type to coerce to with 'us' causes problems. > When the series has timestamps with a timezone, the unit is ignored. When > the series does not have a timezone, it is applied but causes an > OverflowError when printing. > {noformat} > >>> import pandas as pd > >>> import pyarrow as pa > >>> from datetime import datetime > >>> s = pd.Series([datetime.now()]) > >>> s_nyc = s.dt.tz_localize('tzlocal()').dt.tz_convert('America/New_York') > >>> arr = pa.Array.from_pandas(s_nyc, type=pa.timestamp('us', > >>> tz='America/New_York')) > >>> arr.type > TimestampType(timestamp[ns, tz=America/New_York]) > >>> arr = pa.Array.from_pandas(s, type=pa.timestamp('us')) > >>> arr.type > TimestampType(timestamp[us]) > >>> print(arr) > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221) > values = array_format(self, window=10) > File "pyarrow/formatting.py", line 28, in array_format > values.append(value_format(x, 0)) > File "pyarrow/formatting.py", line 49, in value_format > return repr(x) > File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535) > return repr(self.as_py()) > File "pyarrow/scalar.pxi", line 240, in pyarrow.lib.TimestampValue.as_py > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:21600) > return converter(value, tzinfo=tzinfo) > File "pyarrow/scalar.pxi", line 204, in pyarrow.lib.lambda5 > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:7295) > TimeUnit_MICRO: lambda x, tzinfo: pd.Timestamp( > File "pandas/_libs/tslib.pyx", line 402, in > pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051) > File "pandas/_libs/tslib.pyx", line 1467, in > pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:27665) > OverflowError: Python int too large to convert to C long > {noformat} > A workaround is to manually change values with astype > {noformat} > >>> arr = pa.Array.from_pandas(s.values.astype('datetime64[us]')) > >>> arr.type > TimestampType(timestamp[us]) > >>> print(arr) > > [ > Timestamp('2017-10-17 11:04:44.308233') > ] > >>> > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write
[ https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1675. - Resolution: Fixed Resolved by PR https://github.com/apache/arrow/commit/238881fae8530a1ae994eb0e283e4783d3dd2855 > [Python] Use RecordBatch.from_pandas in FeatherWriter.write > --- > > Key: ARROW-1675 > URL: https://issues.apache.org/jira/browse/ARROW-1675 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > In addition to making the implementation simpler, we will also benefit from > multithreaded conversions, so faster write speeds -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write
[ https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219863#comment-16219863 ] ASF GitHub Bot commented on ARROW-1675: --- wesm closed pull request #1250: ARROW-1675: [Python] Use RecordBatch.from_pandas in Feather write path URL: https://github.com/apache/arrow/pull/1250 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py index 2091c9154..3ba9d652c 100644 --- a/python/pyarrow/feather.py +++ b/python/pyarrow/feather.py @@ -23,7 +23,7 @@ from pyarrow.compat import pdapi from pyarrow.lib import FeatherError # noqa -from pyarrow.lib import Table +from pyarrow.lib import RecordBatch, Table import pyarrow.lib as ext try: @@ -75,30 +75,12 @@ def write(self, df): if not df.columns.is_unique: raise ValueError("cannot serialize duplicate column names") -# TODO(wesm): pipeline conversion to Arrow memory layout -for i, name in enumerate(df.columns): -col = df.iloc[:, i] - -if pdapi.is_object_dtype(col): -inferred_type = infer_dtype(col) -msg = ("cannot serialize column {n} " - "named {name} with dtype {dtype}".format( - n=i, name=name, dtype=inferred_type)) - -if inferred_type in ['mixed']: - -# allow columns with nulls + an inferable type -inferred_type = infer_dtype(col[col.notnull()]) -if inferred_type in ['mixed']: -raise ValueError(msg) - -elif inferred_type not in ['unicode', 'string']: -raise ValueError(msg) - -if not isinstance(name, six.string_types): -name = str(name) - -self.writer.write_array(name, col) +# TODO(wesm): Remove this length check, see ARROW-1732 +if len(df.columns) > 0: +batch = RecordBatch.from_pandas(df, preserve_index=False) +for i, name in enumerate(batch.schema.names): +col = batch[i] +self.writer.write_array(name, col) self.writer.close() diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index 810ee3c8c..9e7fc8863 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -279,11 +279,14 @@ def test_delete_partial_file_on_error(self): if sys.platform == 'win32': pytest.skip('Windows hangs on to file handle for some reason') +class CustomClass(object): +pass + # strings will fail df = pd.DataFrame( { 'numbers': range(5), -'strings': [b'foo', None, u'bar', 'qux', np.nan]}, +'strings': [b'foo', None, u'bar', CustomClass(), np.nan]}, columns=['numbers', 'strings']) path = random_path() @@ -297,10 +300,13 @@ def test_delete_partial_file_on_error(self): def test_strings(self): repeats = 1000 -# we hvae mixed bytes, unicode, strings +# Mixed bytes, unicode, strings coerced to binary values = [b'foo', None, u'bar', 'qux', np.nan] df = pd.DataFrame({'strings': values * repeats}) -self._assert_error_on_write(df, ValueError) + +ex_values = [b'foo', None, b'bar', b'qux', np.nan] +expected = pd.DataFrame({'strings': ex_values * repeats}) +self._check_pandas_roundtrip(df, expected, null_counts=[2 * repeats]) # embedded nulls are ok values = ['foo', None, 'bar', 'qux', None] diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi index 686e56ead..c9a490960 100644 --- a/python/pyarrow/types.pxi +++ b/python/pyarrow/types.pxi @@ -662,7 +662,6 @@ cdef _as_type(type): return type_for_alias(type) - cdef set PRIMITIVE_TYPES = set([ _Type_NA, _Type_BOOL, _Type_UINT8, _Type_INT8, This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Use RecordBatch.from_pandas in FeatherWriter.write > --- > > Key: ARROW-1675 > URL: https://issues.apache.org/jira/browse/ARROW-1675 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >R
[jira] [Resolved] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1721. - Resolution: Fixed Resolved by PR https://github.com/apache/arrow/commit/48a6ff856cf4de939f5ced42a09b1b39866efc1e > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write
[ https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219860#comment-16219860 ] ASF GitHub Bot commented on ARROW-1675: --- wesm commented on issue #1250: ARROW-1675: [Python] Use RecordBatch.from_pandas in Feather write path URL: https://github.com/apache/arrow/pull/1250#issuecomment-339530622 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Use RecordBatch.from_pandas in FeatherWriter.write > --- > > Key: ARROW-1675 > URL: https://issues.apache.org/jira/browse/ARROW-1675 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > In addition to making the implementation simpler, we will also benefit from > multithreaded conversions, so faster write speeds -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219857#comment-16219857 ] ASF GitHub Bot commented on ARROW-1721: --- wesm closed pull request #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc b/cpp/src/arrow/python/numpy_to_arrow.cc index 2c89a9f61..ead3a0481 100644 --- a/cpp/src/arrow/python/numpy_to_arrow.cc +++ b/cpp/src/arrow/python/numpy_to_arrow.cc @@ -622,8 +622,12 @@ Status NumPyConverter::ConvertDates() { Ndarray1DIndexer objects(arr_); + Ndarray1DIndexer mask_values; + + bool have_mask = false; if (mask_ != nullptr) { -return Status::NotImplemented("mask not supported in object conversions yet"); +mask_values.Init(mask_); +have_mask = true; } BuilderType builder(pool_); @@ -636,10 +640,10 @@ Status NumPyConverter::ConvertDates() { PyObject* obj; for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; -if (PyDate_CheckExact(obj)) { - RETURN_NOT_OK(builder.Append(UnboxDate::Unbox(obj))); -} else if (PandasObjectIsNull(obj)) { +if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) { RETURN_NOT_OK(builder.AppendNull()); +} else if (PyDate_CheckExact(obj)) { + RETURN_NOT_OK(builder.Append(UnboxDate::Unbox(obj))); } else { std::stringstream ss; ss << "Error converting from Python objects to Date: "; @@ -1029,6 +1033,41 @@ Status LoopPySequence(PyObject* sequence, T func) { return Status::OK(); } +template +Status LoopPySequenceWithMasks(PyObject* sequence, + const Ndarray1DIndexer& mask_values, + bool have_mask, T func) { + if (PySequence_Check(sequence)) { +OwnedRef ref; +Py_ssize_t size = PySequence_Size(sequence); +if (PyArray_Check(sequence)) { + auto array = reinterpret_cast(sequence); + Ndarray1DIndexer objects(array); + for (int64_t i = 0; i < size; ++i) { +RETURN_NOT_OK(func(objects[i], have_mask && mask_values[i])); + } +} else { + for (int64_t i = 0; i < size; ++i) { +ref.reset(PySequence_GetItem(sequence, i)); +RETURN_NOT_OK(func(ref.obj(), have_mask && mask_values[i])); + } +} + } else if (PyObject_HasAttrString(sequence, "__iter__")) { +OwnedRef iter = OwnedRef(PyObject_GetIter(sequence)); +PyObject* item; +int64_t i = 0; +while ((item = PyIter_Next(iter.obj( { + OwnedRef ref = OwnedRef(item); + RETURN_NOT_OK(func(ref.obj(), have_mask && mask_values[i])); + i++; +} + } else { +return Status::TypeError("Object is not a sequence or iterable"); + } + + return Status::OK(); +} + template inline Status NumPyConverter::ConvertTypedLists(const std::shared_ptr& type, ListBuilder* builder, PyObject* list) { @@ -1037,15 +1076,18 @@ inline Status NumPyConverter::ConvertTypedLists(const std::shared_ptr& PyAcquireGIL lock; - // TODO: mask not supported here + Ndarray1DIndexer mask_values; + + bool have_mask = false; if (mask_ != nullptr) { -return Status::NotImplemented("mask not supported in object conversions yet"); +mask_values.Init(mask_); +have_mask = true; } BuilderT* value_builder = static_cast(builder->value_builder()); - auto foreach_item = [&](PyObject* object) { -if (PandasObjectIsNull(object)) { + auto foreach_item = [&](PyObject* object, bool mask) { +if (mask || PandasObjectIsNull(object)) { return builder->AppendNull(); } else if (PyArray_Check(object)) { auto numpy_array = reinterpret_cast(object); @@ -1071,7 +1113,7 @@ inline Status NumPyConverter::ConvertTypedLists(const std::shared_ptr& } }; - return LoopPySequence(list, foreach_item); + return LoopPySequenceWithMasks(list, mask_values, have_mask, foreach_item); } template <> @@ -1079,15 +1121,18 @@ inline Status NumPyConverter::ConvertTypedLists( const std::shared_ptr& type, ListBuilder* builder, PyObject* list) { PyAcquireGIL lock; - // TODO: mask not supported here + Ndarray1DIndexer mask_values; + + bool have_mask = false; if (mask_ != nullptr) { -return Status::NotImplemented("mask not supported in object conversions yet"); +mask_values.Init(mask_); +have_mask = true; } auto value_builder = static_cast(builder->value_builder()); - auto foreach_item = [&](PyObject* object) { -if (PandasObjectIsN
[jira] [Resolved] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units
[ https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1484. - Resolution: Fixed Resolved by PR https://github.com/apache/arrow/commit/54d5c81af0a9cbc6ea551922c795728cd43bd86c > [C++] Implement (safe and unsafe) casts between timestamps and times of > different units > --- > > Key: ARROW-1484 > URL: https://issues.apache.org/jira/browse/ARROW-1484 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units
[ https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219848#comment-16219848 ] ASF GitHub Bot commented on ARROW-1484: --- wesm closed pull request #1245: ARROW-1484: [C++/Python] Implement casts between date, time, timestamp units URL: https://github.com/apache/arrow/pull/1245 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/compute/cast.cc b/cpp/src/arrow/compute/cast.cc index e8bbfd347..68a2b1237 100644 --- a/cpp/src/arrow/compute/cast.cc +++ b/cpp/src/arrow/compute/cast.cc @@ -25,6 +25,7 @@ #include #include #include +#include #include "arrow/array.h" #include "arrow/buffer.h" @@ -68,6 +69,24 @@ namespace arrow { namespace compute { +template +inline const T* GetValuesAs(const ArrayData& data, int i) { + return reinterpret_cast(data.buffers[i]->data()) + data.offset; +} + +namespace { + +void CopyData(const Array& input, ArrayData* output) { + auto in_data = input.data(); + output->length = in_data->length; + output->null_count = input.null_count(); + output->buffers = in_data->buffers; + output->offset = in_data->offset; + output->child_data = in_data->child_data; +} + +} // namespace + // -- // Zero copy casts @@ -77,7 +96,9 @@ struct is_zero_copy_cast { }; template -struct is_zero_copy_cast::value>::type> { +struct is_zero_copy_cast< +O, I, typename std::enable_if::value && + !std::is_base_of::value>::type> { static constexpr bool value = true; }; @@ -102,10 +123,7 @@ template struct CastFunctor::value>::type> { void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, ArrayData* output) { -auto in_data = input.data(); -output->null_count = input.null_count(); -output->buffers = in_data->buffers; -output->child_data = in_data->child_data; +CopyData(input, output); } }; @@ -119,6 +137,7 @@ struct CastFunctorbuffers[1]; +DCHECK_EQ(output->offset, 0); memset(buf->mutable_data(), 0, buf->size()); } }; @@ -139,12 +158,16 @@ struct CastFunctorbuffers[1]->data(); -auto out = reinterpret_cast(output->buffers[1]->mutable_data()); constexpr auto kOne = static_cast(1); constexpr auto kZero = static_cast(0); + +auto in_data = input.data(); +internal::BitmapReader bit_reader(in_data->buffers[1]->data(), in_data->offset, + in_data->length); +auto out = reinterpret_cast(output->buffers[1]->mutable_data()); for (int64_t i = 0; i < input.length(); ++i) { - *out++ = BitUtil::GetBit(data, i) ? kOne : kZero; + *out++ = bit_reader.IsSet() ? kOne : kZero; + bit_reader.Next(); } } }; @@ -189,7 +212,9 @@ struct CastFunctor::v void operator()(FunctionContext* ctx, const CastOptions& options, const Array& input, ArrayData* output) { using in_type = typename I::c_type; -auto in_data = reinterpret_cast(input.data()->buffers[1]->data()); +DCHECK_EQ(output->offset, 0); + +const in_type* in_data = GetValuesAs(*input.data(), 1); uint8_t* out_data = reinterpret_cast(output->buffers[1]->mutable_data()); for (int64_t i = 0; i < input.length(); ++i) { BitUtil::SetBitTo(out_data, i, (*in_data++) != 0); @@ -204,12 +229,11 @@ struct CastFunctoroffset, 0); auto in_offset = input.offset(); -const auto& input_buffers = input.data()->buffers; - -auto in_data = reinterpret_cast(input_buffers[1]->data()) + in_offset; +const in_type* in_data = GetValuesAs(*input.data(), 1); auto out_data = reinterpret_cast(output->buffers[1]->mutable_data()); if (!options.allow_int_overflow) { @@ -217,14 +241,15 @@ struct CastFunctor(std::numeric_limits::min()); if (input.null_count() > 0) { -const uint8_t* is_valid = input_buffers[0]->data(); -int64_t is_valid_offset = in_offset; +internal::BitmapReader is_valid_reader(input.data()->buffers[0]->data(), + in_offset, input.length()); for (int64_t i = 0; i < input.length(); ++i) { - if (ARROW_PREDICT_FALSE(BitUtil::GetBit(is_valid, is_valid_offset++) && + if (ARROW_PREDICT_FALSE(is_valid_reader.IsSet() && (*in_data > kMax || *in_data < kMin))) { ctx->SetStatus(Status::Invalid("Integer value out of bounds")); } *out_data++ = static_cast(*in_data++); + is_valid_reader.Next(); } } else { for (int64_t i = 0; i < i
[jira] [Commented] (ARROW-587) Add JIRA fix version to merge tool
[ https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219850#comment-16219850 ] ASF GitHub Bot commented on ARROW-587: -- wesm commented on issue #1248: ARROW-587: Add fix version to PR merge tool URL: https://github.com/apache/arrow/pull/1248#issuecomment-339528271 Tried merging #1245 but there was a bug. Will keep at it until this script is right This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add JIRA fix version to merge tool > -- > > Key: ARROW-587 > URL: https://issues.apache.org/jira/browse/ARROW-587 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > Like parquet-mr's tool. This will make releases less painful -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219844#comment-16219844 ] ASF GitHub Bot commented on ARROW-1425: --- wesm commented on issue #1095: ARROW-1425 [Python] Document semantic differences between Spark and Arrow timestamps URL: https://github.com/apache/arrow/pull/1095#issuecomment-339527887 @icexelloss @heimir-sverrisson it may make sense to engage in https://github.com/apache/spark/pull/18664 and at least try to process the discussion that is going on around time zones. This is some very thorny stuff and I don't have the bandwidth right this moment to properly engage with this This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > Fix For: 0.8.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1425: -- Labels: pull-request-available (was: ) > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > Fix For: 0.8.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1733) [C++] Utility for allocating fixed-size mutable primitive ArrayData with a single memory allocation
Wes McKinney created ARROW-1733: --- Summary: [C++] Utility for allocating fixed-size mutable primitive ArrayData with a single memory allocation Key: ARROW-1733 URL: https://issues.apache.org/jira/browse/ARROW-1733 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney The validity bitmap and the values for the primitive data would be part of a single allocation, so better heap locality and possibly better performance in aggregate. This same approach is also being worked on for Java -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1710) [Java] Decide what to do with non-nullable vectors in new vector class hierarchy
[ https://issues.apache.org/jira/browse/ARROW-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219836#comment-16219836 ] Jacques Nadeau commented on ARROW-1710: --- I'm one of the voices strongly arguing for dropping the additional class objects. (I also was the one who originally introduced the two separate sets when the code was first developed.) My experience has been the following: * Extra complexity of managing two different runtime classes is very expensive (maintenance, coercing between, managing runtime code generation, etc) * Most source data is actually declared as nullable but rarely has nulls As such, having an adaptive interaction where you look at cells 64 values at a time and adapt your behavior based on actual nullability (as opposed to declared nullability) provides a much better performance lift in real world use cases than having specialized code for declared non-nullable situations. FYI: [~e.levine], the updated approach with vectors is moving to a situation where we don't have a bit vector and ultimately also consolidates the buffer for the bits and the fixed bytes in the same buffer. In that case, there is no heap memory overhead and the direct memory overhead is 1 bit per value, far less than necessary. Also note that in reality, most people focused on super high performance Java implementations interact directly with the memory. You can see an example of how we do this here: https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java#L89 If, in the future, if people need the vector classes to have an additional set of methods such as: allocateNewNoNull() setSafeIgnoreNull(int index, int value) let's just add those when someone's usecase requires it. No need to have an extra set of vectors for that purpose. > [Java] Decide what to do with non-nullable vectors in new vector class > hierarchy > - > > Key: ARROW-1710 > URL: https://issues.apache.org/jira/browse/ARROW-1710 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java - Vectors >Reporter: Li Jin > Fix For: 0.8.0 > > > So far the consensus seems to be remove all non-nullable vectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219709#comment-16219709 ] ASF GitHub Bot commented on ARROW-1721: --- Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339501657 @wesm Thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
[ https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette reassigned ARROW-1727: Assignee: Brian Hulette > [Format] Expand Arrow streaming format to permit new dictionaries and deltas > / additions to existing dictionaries > - > > Key: ARROW-1727 > URL: https://issues.apache.org/jira/browse/ARROW-1727 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney >Assignee: Brian Hulette > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1047) [Java] Add generalized stream writer and reader interfaces that are decoupled from IO / message framing
[ https://issues.apache.org/jira/browse/ARROW-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-1047: --- Assignee: Bryan Cutler > [Java] Add generalized stream writer and reader interfaces that are decoupled > from IO / message framing > --- > > Key: ARROW-1047 > URL: https://issues.apache.org/jira/browse/ARROW-1047 > Project: Apache Arrow > Issue Type: New Feature > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Bryan Cutler > > cc [~julienledem] [~elahrvivaz] [~nongli] > The ArrowWriter > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java > accepts a WriteableByteChannel where the stream is written > It would be useful to be able to support other kinds of message framing and > transport, like GRPC or HTTP. So rather than writing a complete Arrow stream > as a single contiguous byte stream, the component messages (schema, > dictionaries, and record batches) would be framed as separate messages in the > underlying protocol. > So if we were using ProtocolBuffers and gRPC as the underlying transport for > the stream, we could encapsulate components of an Arrow stream in objects > like: > {code:language=protobuf} > message ArrowMessagePB { > required bytes serialized_data; > } > {code} > If the transport supports zero copy, that is obviously better than > serializing then parsing a protocol buffer. > We should do this work in C++ as well to support more flexible stream > transport. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219454#comment-16219454 ] ASF GitHub Bot commented on ARROW-1723: --- MaxRis commented on a change in pull request #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#discussion_r146974055 ## File path: cpp/cmake_modules/BuildUtils.cmake ## @@ -154,22 +161,28 @@ function(ADD_ARROW_LIB LIB_NAME) endif() if (ARROW_BUILD_STATIC) - if (MSVC) -set(LIB_NAME_STATIC ${LIB_NAME}_static) - else() -set(LIB_NAME_STATIC ${LIB_NAME}) - endif() - add_library(${LIB_NAME}_static STATIC $) +if (MSVC) + set(LIB_NAME_STATIC ${LIB_NAME}_static) +else() + set(LIB_NAME_STATIC ${LIB_NAME}) +endif() +add_library(${LIB_NAME}_static STATIC ${LIB_DEPS}) +if(EXTRA_DEPS) + add_dependencies(${LIB_NAME}_static ${EXTRA_DEPS}) +endif() + set_target_properties(${LIB_NAME}_static PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" OUTPUT_NAME ${LIB_NAME_STATIC}) - target_link_libraries(${LIB_NAME}_static +target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC) Review comment: @JohnPJenkins To avoid confusion, it might make sense to define this only for MSVC This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1660) [Python] pandas field values are messed up across rows
[ https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219432#comment-16219432 ] MIkhail Osckin commented on ARROW-1660: --- I definitely tested it with the latest pyarrow version at the moment. I had the same intuition that this issue might be related to splicing, because my initial dataset was ordered by id field and top of the dataset (after to_pandas) was something like this 10012, 10015, 10034, and the row with id like 10018 had values from 100034 and only part of them at least in one column (and if i remember well 10018 was the exact third id by ascendence. > [Python] pandas field values are messed up across rows > -- > > Key: ARROW-1660 > URL: https://issues.apache.org/jira/browse/ARROW-1660 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 > Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3 >Reporter: MIkhail Osckin >Assignee: Wes McKinney > > I have the following scala case class to store sparse matrix data to read it > later using python > {code:java} > case class CooVector( > id: Int, > row_ids: Seq[Int], > rowsIdx: Seq[Int], > colIdx: Seq[Int], > data: Seq[Double]) > {code} > I save the dataset of this type to multiple parquet files using spark and > then read it using pyarrow.parquet and convert the result to pandas dataset. > The problem i have is that some values end up in wrong rows, for example, > row_ids might end up in wrong cooVector row. I have no idea what the reason > is but might be it is related to the fact that the fields are of variable > sizes. And everything is correct if i read it using spark. Also i checked > to_pydict method and the result is correct, so seems like the problem > somewhere in to_pandas method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1728) [C++] Run clang-format checks in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1728: -- Labels: pull-request-available (was: ) > [C++] Run clang-format checks in Travis CI > -- > > Key: ARROW-1728 > URL: https://issues.apache.org/jira/browse/ARROW-1728 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > I think it's reasonable to expect contributors to run clang-format on their > code. This may lead to a higher number of failed builds but will eliminate > noise diffs in unrelated patches -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1728) [C++] Run clang-format checks in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219426#comment-16219426 ] ASF GitHub Bot commented on ARROW-1728: --- wesm opened a new pull request #1251: ARROW-1728: [C++] Run clang-format checks in Travis CI URL: https://github.com/apache/arrow/pull/1251 I also deliberately checked in a single flake so I can confirm this is working properly This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Run clang-format checks in Travis CI > -- > > Key: ARROW-1728 > URL: https://issues.apache.org/jira/browse/ARROW-1728 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > I think it's reasonable to expect contributors to run clang-format on their > code. This may lead to a higher number of failed builds but will eliminate > noise diffs in unrelated patches -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1728) [C++] Run clang-format checks in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1728: --- Assignee: Wes McKinney > [C++] Run clang-format checks in Travis CI > -- > > Key: ARROW-1728 > URL: https://issues.apache.org/jira/browse/ARROW-1728 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > I think it's reasonable to expect contributors to run clang-format on their > code. This may lead to a higher number of failed builds but will eliminate > noise diffs in unrelated patches -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1491) [C++] Add casting implementations from strings to numbers or boolean
[ https://issues.apache.org/jira/browse/ARROW-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219398#comment-16219398 ] Wes McKinney commented on ARROW-1491: - While this would be nice, it's not immediately urgent. Some help would be appreciated > [C++] Add casting implementations from strings to numbers or boolean > > > Key: ARROW-1491 > URL: https://issues.apache.org/jira/browse/ARROW-1491 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1491) [C++] Add casting implementations from strings to numbers or boolean
[ https://issues.apache.org/jira/browse/ARROW-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1491: Fix Version/s: (was: 0.8.0) 0.9.0 > [C++] Add casting implementations from strings to numbers or boolean > > > Key: ARROW-1491 > URL: https://issues.apache.org/jira/browse/ARROW-1491 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219395#comment-16219395 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339448319 Turns out I can push to your branch, so done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219390#comment-16219390 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339447985 Couple flake8 warnings: ``` +flake8 --count /home/travis/build/apache/arrow/python/pyarrow /home/travis/build/apache/arrow/python/pyarrow/tests/test_convert_pandas.py:22:1: F401 'unittest' imported but unused /home/travis/build/apache/arrow/python/pyarrow/tests/test_convert_pandas.py:1115:16: E231 missing whitespace after ',' ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write
[ https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1675: -- Labels: pull-request-available (was: ) > [Python] Use RecordBatch.from_pandas in FeatherWriter.write > --- > > Key: ARROW-1675 > URL: https://issues.apache.org/jira/browse/ARROW-1675 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > In addition to making the implementation simpler, we will also benefit from > multithreaded conversions, so faster write speeds -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write
[ https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219316#comment-16219316 ] ASF GitHub Bot commented on ARROW-1675: --- wesm opened a new pull request #1250: ARROW-1675: [Python] Use RecordBatch.from_pandas in Feather write path URL: https://github.com/apache/arrow/pull/1250 This also makes Feather writes more robust to columns having a mix of unicode and bytes (these gets coerced to binary) Also resolves ARROW-1672 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Use RecordBatch.from_pandas in FeatherWriter.write > --- > > Key: ARROW-1675 > URL: https://issues.apache.org/jira/browse/ARROW-1675 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > In addition to making the implementation simpler, we will also benefit from > multithreaded conversions, so faster write speeds -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1672) [Python] Failure to write Feather bytes column
[ https://issues.apache.org/jira/browse/ARROW-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1672: --- Assignee: Wes McKinney > [Python] Failure to write Feather bytes column > -- > > Key: ARROW-1672 > URL: https://issues.apache.org/jira/browse/ARROW-1672 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > See bug report in https://github.com/wesm/feather/issues/320 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1732) [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False
Wes McKinney created ARROW-1732: --- Summary: [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False Key: ARROW-1732 URL: https://issues.apache.org/jira/browse/ARROW-1732 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.8.0 I believe this should have well-defined behavior and not raise an error: {code} In [5]: pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) --- ValueErrorTraceback (most recent call last) in () > 1 pa.RecordBatch.from_pandas(pd.DataFrame({}), preserve_index=False) ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_pandas (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:39957)() 586 df, schema, preserve_index, nthreads=nthreads 587 ) --> 588 return cls.from_arrays(arrays, names, metadata) 589 590 @staticmethod ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays (/home/wesm/code/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:40130)() 615 616 if not number_of_arrays: --> 617 raise ValueError('Record batch cannot contain no arrays (for now)') 618 619 num_rows = len(arrays[0]) ValueError: Record batch cannot contain no arrays (for now) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1718) [Python] Creating a pyarrow.Array of date type from pandas causes error
[ https://issues.apache.org/jira/browse/ARROW-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-1718: Description: When calling {{Array.from_pandas}} with a pandas.Series of dates and specifying the desired pyarrow type, an error occurs. If the type is not specified then {{from_pandas}} will interpret the data as a timestamp type. {code} import pandas as pd import pyarrow as pa import datetime arr = pa.array([datetime.date(2017, 10, 23)]) c = pa.Column.from_array("d", arr) s = c.to_pandas() print(s) # 0 2017-10-23 # Name: d, dtype: datetime64[ns] result = pa.Array.from_pandas(s, type=pa.date32()) print(result) """ Traceback (most recent call last): File "", line 1, in File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221) File "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", line 28, in array_format values.append(value_format(x, 0)) File "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", line 49, in value_format return repr(x) File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535) File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368) ValueError: year is out of range """ {code} was: When calling {Array.from_pandas} with a pandas.Series of dates and specifying the desired pyarrow type, an error occurs. If the type is not specified then {from_pandas} will interpret the data as a timestamp type. {code} import pandas as pd import pyarrow as pa import datetime arr = pa.array([datetime.date(2017, 10, 23)]) c = pa.Column.from_array("d", arr) s = c.to_pandas() print(s) # 0 2017-10-23 # Name: d, dtype: datetime64[ns] result = pa.Array.from_pandas(s, type=pa.date32()) print(result) """ Traceback (most recent call last): File "", line 1, in File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221) File "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", line 28, in array_format values.append(value_format(x, 0)) File "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", line 49, in value_format return repr(x) File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535) File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368) ValueError: year is out of range """ {code} > [Python] Creating a pyarrow.Array of date type from pandas causes error > --- > > Key: ARROW-1718 > URL: https://issues.apache.org/jira/browse/ARROW-1718 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Assignee: Wes McKinney > Fix For: 0.8.0 > > > When calling {{Array.from_pandas}} with a pandas.Series of dates and > specifying the desired pyarrow type, an error occurs. If the type is not > specified then {{from_pandas}} will interpret the data as a timestamp type. > {code} > import pandas as pd > import pyarrow as pa > import datetime > arr = pa.array([datetime.date(2017, 10, 23)]) > c = pa.Column.from_array("d", arr) > s = c.to_pandas() > print(s) > # 0 2017-10-23 > # Name: d, dtype: datetime64[ns] > result = pa.Array.from_pandas(s, type=pa.date32()) > print(result) > """ > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221) > File > "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", > line 28, in array_format > values.append(value_format(x, 0)) > File > "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", > line 49, in value_format > return repr(x) > File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535) > File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368) > ValueError: year is out o
[jira] [Created] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas
Wes McKinney created ARROW-1731: --- Summary: [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas Key: ARROW-1731 URL: https://issues.apache.org/jira/browse/ARROW-1731 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Currently it's all-or-nothing, and to do the subsetting in pandas incurs a data copy. This would enable columns (by name or index) to be selected out without additional data copying cc [~cpcloud] [~jreback] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write
[ https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1675: --- Assignee: Wes McKinney > [Python] Use RecordBatch.from_pandas in FeatherWriter.write > --- > > Key: ARROW-1675 > URL: https://issues.apache.org/jira/browse/ARROW-1675 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > In addition to making the implementation simpler, we will also benefit from > multithreaded conversions, so faster write speeds -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1718) [Python] Creating a pyarrow.Array of date type from pandas causes error
[ https://issues.apache.org/jira/browse/ARROW-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1718: --- Assignee: Wes McKinney > [Python] Creating a pyarrow.Array of date type from pandas causes error > --- > > Key: ARROW-1718 > URL: https://issues.apache.org/jira/browse/ARROW-1718 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Assignee: Wes McKinney > Fix For: 0.8.0 > > > When calling {Array.from_pandas} with a pandas.Series of dates and specifying > the desired pyarrow type, an error occurs. If the type is not specified then > {from_pandas} will interpret the data as a timestamp type. > {code} > import pandas as pd > import pyarrow as pa > import datetime > arr = pa.array([datetime.date(2017, 10, 23)]) > c = pa.Column.from_array("d", arr) > s = c.to_pandas() > print(s) > # 0 2017-10-23 > # Name: d, dtype: datetime64[ns] > result = pa.Array.from_pandas(s, type=pa.date32()) > print(result) > """ > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221) > File > "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", > line 28, in array_format > values.append(value_format(x, 0)) > File > "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", > line 49, in value_format > return repr(x) > File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535) > File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py > (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368) > ValueError: year is out of range > """ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1730) [Python] Incorrect result from pyarrow.array when passing timestamp type
[ https://issues.apache.org/jira/browse/ARROW-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219186#comment-16219186 ] Wes McKinney commented on ARROW-1730: - But {code} In [15]: pa.array(np.array([0], dtype='int64'), type=pa.timestamp('ns')) Out[15]: [ Timestamp('1970-01-01 00:00:00') ] {code} > [Python] Incorrect result from pyarrow.array when passing timestamp type > > > Key: ARROW-1730 > URL: https://issues.apache.org/jira/browse/ARROW-1730 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney > Fix For: 0.8.0 > > > Even with the ARROW-1484 patch, we have: > {code: language=python} > In [10]: pa.array([0], type=pa.timestamp('ns')) > Out[10]: > > [ > Timestamp('1968-01-12 11:18:14.409378304') > ] > In [11]: pa.array([0], type='int64').cast(pa.timestamp('ns')) > Out[11]: > > [ > Timestamp('1970-01-01 00:00:00') > ] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1730) [Python] Incorrect result from pyarrow.array when passing timestamp type
Wes McKinney created ARROW-1730: --- Summary: [Python] Incorrect result from pyarrow.array when passing timestamp type Key: ARROW-1730 URL: https://issues.apache.org/jira/browse/ARROW-1730 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.8.0 Even with the ARROW-1484 patch, we have: {code: language=python} In [10]: pa.array([0], type=pa.timestamp('ns')) Out[10]: [ Timestamp('1968-01-12 11:18:14.409378304') ] In [11]: pa.array([0], type='int64').cast(pa.timestamp('ns')) Out[11]: [ Timestamp('1970-01-01 00:00:00') ] {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1660) [Python] pandas field values are messed up across rows
[ https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1660: Fix Version/s: (was: 0.8.0) > [Python] pandas field values are messed up across rows > -- > > Key: ARROW-1660 > URL: https://issues.apache.org/jira/browse/ARROW-1660 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 > Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3 >Reporter: MIkhail Osckin >Assignee: Wes McKinney > > I have the following scala case class to store sparse matrix data to read it > later using python > {code:java} > case class CooVector( > id: Int, > row_ids: Seq[Int], > rowsIdx: Seq[Int], > colIdx: Seq[Int], > data: Seq[Double]) > {code} > I save the dataset of this type to multiple parquet files using spark and > then read it using pyarrow.parquet and convert the result to pandas dataset. > The problem i have is that some values end up in wrong rows, for example, > row_ids might end up in wrong cooVector row. I have no idea what the reason > is but might be it is related to the fact that the fields are of variable > sizes. And everything is correct if i read it using spark. Also i checked > to_pydict method and the result is correct, so seems like the problem > somewhere in to_pandas method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1660) [Python] pandas field values are messed up across rows
[ https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219171#comment-16219171 ] Wes McKinney commented on ARROW-1660: - Is it possible you were using pyarrow < 0.7.0? There was a bug ARROW-1357 that was fixed that would cause the issue you were seeing. I'm a bit at a loss since the relevant test case is https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_convert_pandas.py#L600. I will move off the 0.8.0 milestone, but leave the issue open in case you can find a repro > [Python] pandas field values are messed up across rows > -- > > Key: ARROW-1660 > URL: https://issues.apache.org/jira/browse/ARROW-1660 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 > Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3 >Reporter: MIkhail Osckin >Assignee: Wes McKinney > > I have the following scala case class to store sparse matrix data to read it > later using python > {code:java} > case class CooVector( > id: Int, > row_ids: Seq[Int], > rowsIdx: Seq[Int], > colIdx: Seq[Int], > data: Seq[Double]) > {code} > I save the dataset of this type to multiple parquet files using spark and > then read it using pyarrow.parquet and convert the result to pandas dataset. > The problem i have is that some values end up in wrong rows, for example, > row_ids might end up in wrong cooVector row. I have no idea what the reason > is but might be it is related to the fact that the fields are of variable > sizes. And everything is correct if i read it using spark. Also i checked > to_pydict method and the result is correct, so seems like the problem > somewhere in to_pandas method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI
[ https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219127#comment-16219127 ] ASF GitHub Bot commented on ARROW-1455: --- wesm commented on issue #1249: ARROW-1455 [Python] Add Dockerfile for validating Dask integration URL: https://github.com/apache/arrow/pull/1249#issuecomment-339405716 We should not check data files in to the git repo, so we will need to handle test data in some other way. We will also want to collect the Python-related integration tests someplace Python-specific. I will review in more detail when I can This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add Dockerfile for validating Dask integration outside of usual CI > --- > > Key: ARROW-1455 > URL: https://issues.apache.org/jira/browse/ARROW-1455 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > > Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the > moment, but we can add a testing set up in > https://github.com/apache/arrow/tree/master/python/testing so that this can > be validated on an ad hoc basis in a reproducible way. > see also ARROW-1417 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI
[ https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219125#comment-16219125 ] ASF GitHub Bot commented on ARROW-1455: --- wesm commented on issue #1249: ARROW-1455 [Python] Add Dockerfile for validating Dask integration URL: https://github.com/apache/arrow/pull/1249#issuecomment-339405716 We should not check in data files in to the git repo, so we will need to handle test data in some other way. We will also want to collect the Python-related integration tests someplace Python-specific. I will review in more detail when I can This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add Dockerfile for validating Dask integration outside of usual CI > --- > > Key: ARROW-1455 > URL: https://issues.apache.org/jira/browse/ARROW-1455 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > > Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the > moment, but we can add a testing set up in > https://github.com/apache/arrow/tree/master/python/testing so that this can > be validated on an ad hoc basis in a reproducible way. > see also ARROW-1417 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI
[ https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1455: -- Labels: pull-request-available (was: ) > [Python] Add Dockerfile for validating Dask integration outside of usual CI > --- > > Key: ARROW-1455 > URL: https://issues.apache.org/jira/browse/ARROW-1455 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > > Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the > moment, but we can add a testing set up in > https://github.com/apache/arrow/tree/master/python/testing so that this can > be validated on an ad hoc basis in a reproducible way. > see also ARROW-1417 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1455) [Python] Add Dockerfile for validating Dask integration outside of usual CI
[ https://issues.apache.org/jira/browse/ARROW-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219122#comment-16219122 ] ASF GitHub Bot commented on ARROW-1455: --- heimir-sverrisson opened a new pull request #1249: ARROW-1455 [Python] Add Dockerfile for validating Dask integration URL: https://github.com/apache/arrow/pull/1249 A Docker container is created with all the dependencies needed to pull down the Dask code from Github and install it locally, together with Arrow, to run an integration test. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add Dockerfile for validating Dask integration outside of usual CI > --- > > Key: ARROW-1455 > URL: https://issues.apache.org/jira/browse/ARROW-1455 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson > Labels: pull-request-available > > Introducing the Dask stack into Arrow's CI might be a bit heavyweight at the > moment, but we can add a testing set up in > https://github.com/apache/arrow/tree/master/python/testing so that this can > be validated on an ad hoc basis in a reproducible way. > see also ARROW-1417 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units
[ https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219123#comment-16219123 ] ASF GitHub Bot commented on ARROW-1484: --- wesm commented on a change in pull request #1245: ARROW-1484: [C++/Python] Implement casts between date, time, timestamp units URL: https://github.com/apache/arrow/pull/1245#discussion_r146927485 ## File path: cpp/src/arrow/compute/compute-test.cc ## @@ -270,6 +275,205 @@ TEST_F(TestCast, ToIntDowncastUnsafe) { options); } +TEST_F(TestCast, TimestampToTimestamp) { + CastOptions options; + + auto CheckTimestampCast = [this]( + const CastOptions& options, TimeUnit::type from_unit, TimeUnit::type to_unit, + const std::vector& from_values, const std::vector& to_values, + const std::vector& is_valid) { +CheckCase( +timestamp(from_unit), from_values, is_valid, timestamp(to_unit), to_values, +options); + }; + + vector is_valid = {true, false, true, true, true}; + + // Multiply promotions + vector v1 = {0, 100, 200, 1, 2}; + vector e1 = {0, 10, 20, 1000, 2000}; + CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MILLI, v1, e1, is_valid); + + vector v2 = {0, 100, 200, 1, 2}; + vector e2 = {0, 1L, 2L, 100, 200}; + CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MICRO, v2, e2, is_valid); + + vector v3 = {0, 100, 200, 1, 2}; + vector e3 = {0, 1000L, 2000L, 10L, 20L}; + CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::NANO, v3, e3, is_valid); + + vector v4 = {0, 100, 200, 1, 2}; + vector e4 = {0, 10, 20, 1000, 2000}; + CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::MICRO, v4, e4, is_valid); + + vector v5 = {0, 100, 200, 1, 2}; + vector e5 = {0, 1L, 2L, 100, 200}; + CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::NANO, v5, e5, is_valid); + + vector v6 = {0, 100, 200, 1, 2}; + vector e6 = {0, 10, 20, 1000, 2000}; + CheckTimestampCast(options, TimeUnit::MICRO, TimeUnit::NANO, v6, e6, is_valid); + + // Zero copy + std::shared_ptr arr; + vector v7 = {0, 7, 2000, 1000, 0}; + ArrayFromVector(timestamp(TimeUnit::SECOND), is_valid, v7, + &arr); + CheckZeroCopy(*arr, timestamp(TimeUnit::SECOND)); + + // Divide, truncate + vector v8 = {0, 100123, 200456, 1123, 2456}; + vector e8 = {0, 100, 200, 1, 2}; + + options.allow_time_truncate = true; Review comment: Thanks for catching. I'll make `safe=True` set this option http://arrow.apache.org/docs/python/generated/pyarrow.lib.Array.html#pyarrow.lib.Array.cast This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Implement (safe and unsafe) casts between timestamps and times of > different units > --- > > Key: ARROW-1484 > URL: https://issues.apache.org/jira/browse/ARROW-1484 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1484) [C++] Implement (safe and unsafe) casts between timestamps and times of different units
[ https://issues.apache.org/jira/browse/ARROW-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219116#comment-16219116 ] ASF GitHub Bot commented on ARROW-1484: --- BryanCutler commented on a change in pull request #1245: ARROW-1484: [C++/Python] Implement casts between date, time, timestamp units URL: https://github.com/apache/arrow/pull/1245#discussion_r146926826 ## File path: cpp/src/arrow/compute/compute-test.cc ## @@ -270,6 +275,205 @@ TEST_F(TestCast, ToIntDowncastUnsafe) { options); } +TEST_F(TestCast, TimestampToTimestamp) { + CastOptions options; + + auto CheckTimestampCast = [this]( + const CastOptions& options, TimeUnit::type from_unit, TimeUnit::type to_unit, + const std::vector& from_values, const std::vector& to_values, + const std::vector& is_valid) { +CheckCase( +timestamp(from_unit), from_values, is_valid, timestamp(to_unit), to_values, +options); + }; + + vector is_valid = {true, false, true, true, true}; + + // Multiply promotions + vector v1 = {0, 100, 200, 1, 2}; + vector e1 = {0, 10, 20, 1000, 2000}; + CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MILLI, v1, e1, is_valid); + + vector v2 = {0, 100, 200, 1, 2}; + vector e2 = {0, 1L, 2L, 100, 200}; + CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::MICRO, v2, e2, is_valid); + + vector v3 = {0, 100, 200, 1, 2}; + vector e3 = {0, 1000L, 2000L, 10L, 20L}; + CheckTimestampCast(options, TimeUnit::SECOND, TimeUnit::NANO, v3, e3, is_valid); + + vector v4 = {0, 100, 200, 1, 2}; + vector e4 = {0, 10, 20, 1000, 2000}; + CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::MICRO, v4, e4, is_valid); + + vector v5 = {0, 100, 200, 1, 2}; + vector e5 = {0, 1L, 2L, 100, 200}; + CheckTimestampCast(options, TimeUnit::MILLI, TimeUnit::NANO, v5, e5, is_valid); + + vector v6 = {0, 100, 200, 1, 2}; + vector e6 = {0, 10, 20, 1000, 2000}; + CheckTimestampCast(options, TimeUnit::MICRO, TimeUnit::NANO, v6, e6, is_valid); + + // Zero copy + std::shared_ptr arr; + vector v7 = {0, 7, 2000, 1000, 0}; + ArrayFromVector(timestamp(TimeUnit::SECOND), is_valid, v7, + &arr); + CheckZeroCopy(*arr, timestamp(TimeUnit::SECOND)); + + // Divide, truncate + vector v8 = {0, 100123, 200456, 1123, 2456}; + vector e8 = {0, 100, 200, 1, 2}; + + options.allow_time_truncate = true; Review comment: Does this option need to be set in pyarrow to prevent an error when truncating? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Implement (safe and unsafe) casts between timestamps and times of > different units > --- > > Key: ARROW-1484 > URL: https://issues.apache.org/jira/browse/ARROW-1484 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1729) [C++] Upgrade clang bits to 5.0 once promoted to stable
Wes McKinney created ARROW-1729: --- Summary: [C++] Upgrade clang bits to 5.0 once promoted to stable Key: ARROW-1729 URL: https://issues.apache.org/jira/browse/ARROW-1729 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney This includes our CI toolchain and pinned clang-format version. According to http://apt.llvm.org/ 5.0 is still the "qualification branch" where 4.0 is stable -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219091#comment-16219091 ] ASF GitHub Bot commented on ARROW-1721: --- Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339400591 @wesm Now fixed! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219090#comment-16219090 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339400213 Thanks! According to llvm.org, clang-5.0 is still the qualification branch (http://apt.llvm.org/) so whenever 5.0 is promoted to stable we'll upgrade our clang bits to 5.0 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
[ https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1727: Fix Version/s: 0.8.0 > [Format] Expand Arrow streaming format to permit new dictionaries and deltas > / additions to existing dictionaries > - > > Key: ARROW-1727 > URL: https://issues.apache.org/jira/browse/ARROW-1727 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
[ https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219088#comment-16219088 ] Wes McKinney commented on ARROW-1727: - Yes, documentation and adding to the Flatbuffers schemas. Flatbuffers supports default values, so we could make the default NEW https://github.com/apache/arrow/blob/master/format/Schema.fbs#L132 > [Format] Expand Arrow streaming format to permit new dictionaries and deltas > / additions to existing dictionaries > - > > Key: ARROW-1727 > URL: https://issues.apache.org/jira/browse/ARROW-1727 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219085#comment-16219085 ] ASF GitHub Bot commented on ARROW-1721: --- Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339399427 @wesm Sorry! I'm using clang-format-5! I'll fix! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219080#comment-16219080 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339398833 I ran clang-format 4.0 locally and got this diff https://github.com/wesm/arrow/commit/7547ac8e70b5279e44fe802bdbd241ad9a8f0d4a This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1660) pandas field values are messed up across rows
[ https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219063#comment-16219063 ] Wes McKinney commented on ARROW-1660: - I think it might be related to splicing together files. I'll write some tests and then close this issue; if you are able to reproduce in the future please let us know > pandas field values are messed up across rows > - > > Key: ARROW-1660 > URL: https://issues.apache.org/jira/browse/ARROW-1660 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 > Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3 >Reporter: MIkhail Osckin > Fix For: 0.8.0 > > > I have the following scala case class to store sparse matrix data to read it > later using python > {code:java} > case class CooVector( > id: Int, > row_ids: Seq[Int], > rowsIdx: Seq[Int], > colIdx: Seq[Int], > data: Seq[Double]) > {code} > I save the dataset of this type to multiple parquet files using spark and > then read it using pyarrow.parquet and convert the result to pandas dataset. > The problem i have is that some values end up in wrong rows, for example, > row_ids might end up in wrong cooVector row. I have no idea what the reason > is but might be it is related to the fact that the fields are of variable > sizes. And everything is correct if i read it using spark. Also i checked > to_pydict method and the result is correct, so seems like the problem > somewhere in to_pandas method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1660) [Python] pandas field values are messed up across rows
[ https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1660: --- Assignee: Wes McKinney > [Python] pandas field values are messed up across rows > -- > > Key: ARROW-1660 > URL: https://issues.apache.org/jira/browse/ARROW-1660 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 > Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3 >Reporter: MIkhail Osckin >Assignee: Wes McKinney > Fix For: 0.8.0 > > > I have the following scala case class to store sparse matrix data to read it > later using python > {code:java} > case class CooVector( > id: Int, > row_ids: Seq[Int], > rowsIdx: Seq[Int], > colIdx: Seq[Int], > data: Seq[Double]) > {code} > I save the dataset of this type to multiple parquet files using spark and > then read it using pyarrow.parquet and convert the result to pandas dataset. > The problem i have is that some values end up in wrong rows, for example, > row_ids might end up in wrong cooVector row. I have no idea what the reason > is but might be it is related to the fact that the fields are of variable > sizes. And everything is correct if i read it using spark. Also i checked > to_pydict method and the result is correct, so seems like the problem > somewhere in to_pandas method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1660) [Python] pandas field values are messed up across rows
[ https://issues.apache.org/jira/browse/ARROW-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1660: Summary: [Python] pandas field values are messed up across rows (was: pandas field values are messed up across rows) > [Python] pandas field values are messed up across rows > -- > > Key: ARROW-1660 > URL: https://issues.apache.org/jira/browse/ARROW-1660 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.1 > Environment: 4.4.0-72-generic #93-Ubuntu SMP x86_64, python3 >Reporter: MIkhail Osckin > Fix For: 0.8.0 > > > I have the following scala case class to store sparse matrix data to read it > later using python > {code:java} > case class CooVector( > id: Int, > row_ids: Seq[Int], > rowsIdx: Seq[Int], > colIdx: Seq[Int], > data: Seq[Double]) > {code} > I save the dataset of this type to multiple parquet files using spark and > then read it using pyarrow.parquet and convert the result to pandas dataset. > The problem i have is that some values end up in wrong rows, for example, > row_ids might end up in wrong cooVector row. I have no idea what the reason > is but might be it is related to the fact that the fields are of variable > sizes. And everything is correct if i read it using spark. Also i checked > to_pydict method and the result is correct, so seems like the problem > somewhere in to_pandas method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (ARROW-1367) [Website] Divide CHANGELOG issues by component and add subheaders
[ https://issues.apache.org/jira/browse/ARROW-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-1367. --- Resolution: Won't Fix Since some issues may be in multiple components, and some not at all, this is a bit complex to generate, for unclear benefit. Users can always browse the fix versions by component on JIRA > [Website] Divide CHANGELOG issues by component and add subheaders > - > > Key: ARROW-1367 > URL: https://issues.apache.org/jira/browse/ARROW-1367 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > This will make the changelog on the website more readable. JIRAs may appear > in more than one component listing. We should practice good JIRA hygiene by > associating all JIRAs with at least one component. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219051#comment-16219051 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339395625 I'm surprised by some of the formatting changes, are you using clang-format-4.0? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
[ https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219039#comment-16219039 ] Brian Hulette commented on ARROW-1727: -- Is the scope of this ticket just making the appropriate documentation and/or flatbuffer spec changes in [/format|https://github.com/apache/arrow/tree/master/format]? I like the idea idea of including a {{NEW/DELTA}} flag in the dictionary batch. Is there a way the flag could be optional and default to {{NEW}} for backwards compatibility? or is that not worth the trouble? > [Format] Expand Arrow streaming format to permit new dictionaries and deltas > / additions to existing dictionaries > - > > Key: ARROW-1727 > URL: https://issues.apache.org/jira/browse/ARROW-1727 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219033#comment-16219033 ] ASF GitHub Bot commented on ARROW-1721: --- Licht-T commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339393634 @wesm Fixed the whole lint issues! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-587) Add JIRA fix version to merge tool
[ https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219021#comment-16219021 ] ASF GitHub Bot commented on ARROW-587: -- wesm opened a new pull request #1248: ARROW-587: Add fix version to PR merge tool URL: https://github.com/apache/arrow/pull/1248 This was ported from parquet-mr/parquet-cpp. We should merge a separate patch with this branch before committing this This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add JIRA fix version to merge tool > -- > > Key: ARROW-587 > URL: https://issues.apache.org/jira/browse/ARROW-587 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > Like parquet-mr's tool. This will make releases less painful -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-587) Add JIRA fix version to merge tool
[ https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-587: - Labels: pull-request-available (was: ) > Add JIRA fix version to merge tool > -- > > Key: ARROW-587 > URL: https://issues.apache.org/jira/browse/ARROW-587 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > Like parquet-mr's tool. This will make releases less painful -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1367) [Website] Divide CHANGELOG issues by component and add subheaders
[ https://issues.apache.org/jira/browse/ARROW-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1367: --- Assignee: Wes McKinney > [Website] Divide CHANGELOG issues by component and add subheaders > - > > Key: ARROW-1367 > URL: https://issues.apache.org/jira/browse/ARROW-1367 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > This will make the changelog on the website more readable. JIRAs may appear > in more than one component listing. We should practice good JIRA hygiene by > associating all JIRAs with at least one component. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-587) Add JIRA fix version to merge tool
[ https://issues.apache.org/jira/browse/ARROW-587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-587: -- Assignee: Wes McKinney > Add JIRA fix version to merge tool > -- > > Key: ARROW-587 > URL: https://issues.apache.org/jira/browse/ARROW-587 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.8.0 > > > Like parquet-mr's tool. This will make releases less painful -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1721: --- Assignee: Wes McKinney > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218996#comment-16218996 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on issue #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#issuecomment-339387796 Thank you for doing this! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1721: --- Assignee: Licht Takeuchi (was: Wes McKinney) > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218993#comment-16218993 ] ASF GitHub Bot commented on ARROW-1723: --- JohnPJenkins commented on issue #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#issuecomment-339387674 Reworked the commit based on discussion - Windows builds now use separate compilation with a conditional ARROW_STATIC macro for static and shared library targets (Unix remains the same). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218991#comment-16218991 ] ASF GitHub Bot commented on ARROW-1721: --- wesm commented on a change in pull request #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246#discussion_r146910978 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -1029,6 +1033,44 @@ Status LoopPySequence(PyObject* sequence, T func) { return Status::OK(); } +template +Status LoopPySequenceWithMasks( +PyObject* sequence, +const Ndarray1DIndexer& mask_values, +bool have_mask, +T func +) { Review comment: Can you run clang-format? (`make format` or `ninja format`). This should also fix the cpplint failure in CI . See https://github.com/apache/arrow/tree/master/cpp#continuous-integration This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1728) [C++] Run clang-format checks in Travis CI
Wes McKinney created ARROW-1728: --- Summary: [C++] Run clang-format checks in Travis CI Key: ARROW-1728 URL: https://issues.apache.org/jira/browse/ARROW-1728 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.8.0 I think it's reasonable to expect contributors to run clang-format on their code. This may lead to a higher number of failed builds but will eliminate noise diffs in unrelated patches -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1726) [GLib] Add setup description to verify C GLib build
[ https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218978#comment-16218978 ] ASF GitHub Bot commented on ARROW-1726: --- wesm closed pull request #1247: ARROW-1726: [GLib] Add setup description to verify C GLib build URL: https://github.com/apache/arrow/pull/1247 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/dev/release/VERIFY.md b/dev/release/VERIFY.md index 3f073e408..5b441ac13 100644 --- a/dev/release/VERIFY.md +++ b/dev/release/VERIFY.md @@ -22,4 +22,55 @@ ## Windows We've provided a convenience script for verifying the C++ and Python builds on -Windows. Read the comments in `verify-release-candidate.bat` for instructions \ No newline at end of file +Windows. Read the comments in `verify-release-candidate.bat` for instructions. + +## Linux and macOS + +We've provided a convenience script for verifying the C++, Python, C +GLib, Java and JavaScript builds on Linux and macOS. Read the comments in +`verify-release-candidate.sh` for instructions. + +### C GLib + +You need the followings to verify C GLib build: + + * GLib + * GObject Introspection + * Ruby (not EOL-ed version is required) + * gobject-introspection gem + * test-unit gem + +You can install them by the followings on Debian GNU/Linux and Ubuntu: + +```console +% sudo apt install -y -V libgirepository1.0-dev ruby-dev +% sudo gem install gobject-introspection test-unit +``` + +You can install them by the followings on CentOS: + +```console +% sudo yum install -y gobject-introspection-devel +% git clone https://github.com/sstephenson/rbenv.git ~/.rbenv +% git clone https://github.com/sstephenson/ruby-build.git ~/.rbenv/plugins/ruby-build +% echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bash_profile +% echo 'eval "$(rbenv init -)"' >> ~/.bash_profile +% exec ${SHELL} --login +% sudo yum install -y gcc make patch openssl-devel readline-devel zlib-devel +% rbenv install 2.4.2 +% rbenv global 2.4.2 +% gem install gobject-introspection test-unit +``` + +You can install them by the followings on macOS: + +```console +% brew install -y gobject-introspection +% gem install gobject-introspection test-unit +``` + +You need to set `PKG_CONFIG_PATH` to find libffi on macOS: + +```console +% export PKG_CONFIG_PATH=$(brew --prefix libffi)/lib/pkgconfig:$PKG_CONFIG_PATH +``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [GLib] Add setup description to verify C GLib build > --- > > Key: ARROW-1726 > URL: https://issues.apache.org/jira/browse/ARROW-1726 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1726) [GLib] Add setup description to verify C GLib build
[ https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1726. - Resolution: Fixed Issue resolved by pull request 1247 [https://github.com/apache/arrow/pull/1247] > [GLib] Add setup description to verify C GLib build > --- > > Key: ARROW-1726 > URL: https://issues.apache.org/jira/browse/ARROW-1726 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file
[ https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218927#comment-16218927 ] ASF GitHub Bot commented on ARROW-473: -- AnkitAggarwalPEC commented on issue #1031: WIP ARROW-473: [C++/Python] Add public API for retrieving block locations for a particular HDFS file URL: https://github.com/apache/arrow/pull/1031#issuecomment-339375539 @cpcloud Is there any environment that is needed to set before this? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add public API for retrieving block locations for a particular > HDFS file > - > > Key: ARROW-473 > URL: https://issues.apache.org/jira/browse/ARROW-473 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > This is necessary for applications looking to schedule data-local work. > libhdfs does not have APIs to request the block locations directly, so we > need to see if the {{hdfsGetHosts}} function will do what we need. For > libhdfs3 there is a public API function -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file
[ https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218923#comment-16218923 ] ASF GitHub Bot commented on ARROW-473: -- AnkitAggarwalPEC commented on issue #1031: WIP ARROW-473: [C++/Python] Add public API for retrieving block locations for a particular HDFS file URL: https://github.com/apache/arrow/pull/1031#issuecomment-339375220 @cpcloud I'm running the script for last 10 minutes But it is still showing the same error Could not execute command: select VERSION() Starting Impala Shell without Kerberos authentication Connected to arrow-hdfs:21000 Server version: impalad version 2.9.0-cdh5.12.0 RELEASE (build 03c6ddbdcec39238be4f5b14a300d5c4f576097e) Query: select VERSION() Query submitted at: 2017-10-25 15:43:59 (Coordinator: http://arrow-hdfs:25000) ERROR: AnalysisException: This Impala daemon is not ready to accept user requests. Status: Waiting for catalog update from the StateStore. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add public API for retrieving block locations for a particular > HDFS file > - > > Key: ARROW-473 > URL: https://issues.apache.org/jira/browse/ARROW-473 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > This is necessary for applications looking to schedule data-local work. > libhdfs does not have APIs to request the block locations directly, so we > need to see if the {{hdfsGetHosts}} function will do what we need. For > libhdfs3 there is a public API function -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1727) [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
Wes McKinney created ARROW-1727: --- Summary: [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries Key: ARROW-1727 URL: https://issues.apache.org/jira/browse/ARROW-1727 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1726) [GLib] Add setup description to verify C GLib build
Kouhei Sutou created ARROW-1726: --- Summary: [GLib] Add setup description to verify C GLib build Key: ARROW-1726 URL: https://issues.apache.org/jira/browse/ARROW-1726 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou Priority: Minor Fix For: 0.8.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1726) [GLib] Add setup description to verify C GLib build
[ https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1726: -- Labels: pull-request-available (was: ) > [GLib] Add setup description to verify C GLib build > --- > > Key: ARROW-1726 > URL: https://issues.apache.org/jira/browse/ARROW-1726 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1726) [GLib] Add setup description to verify C GLib build
[ https://issues.apache.org/jira/browse/ARROW-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218708#comment-16218708 ] ASF GitHub Bot commented on ARROW-1726: --- kou opened a new pull request #1247: ARROW-1726: [GLib] Add setup description to verify C GLib build URL: https://github.com/apache/arrow/pull/1247 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [GLib] Add setup description to verify C GLib build > --- > > Key: ARROW-1726 > URL: https://issues.apache.org/jira/browse/ARROW-1726 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1721: -- Labels: pull-request-available (was: ) > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1721) [Python] Support null mask in places where it isn't supported in numpy_to_arrow.cc
[ https://issues.apache.org/jira/browse/ARROW-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218681#comment-16218681 ] ASF GitHub Bot commented on ARROW-1721: --- Licht-T opened a new pull request #1246: ARROW-1721: [Python] Implement null-mask check in places where it isn't supported in numpy_to_arrow.cc URL: https://github.com/apache/arrow/pull/1246 This closes [ARROW-1721](https://issues.apache.org/jira/projects/ARROW/issues/ARROW-1721). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support null mask in places where it isn't supported in > numpy_to_arrow.cc > -- > > Key: ARROW-1721 > URL: https://issues.apache.org/jira/browse/ARROW-1721 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 0.8.0 > > > see https://github.com/apache/spark/pull/18664#discussion_r146472109 for > SPARK-21375 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218645#comment-16218645 ] ASF GitHub Bot commented on ARROW-1723: --- wesm commented on issue #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#issuecomment-33954 @MaxRis yeah, I agree on that. In case we support other build systems (like Bazel) in the future it would be better to have the exports explicit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1209) [C++] Implement converter between Arrow record batches and Avro records
[ https://issues.apache.org/jira/browse/ARROW-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218641#comment-16218641 ] ASF GitHub Bot commented on ARROW-1209: --- mariusvniekerk commented on issue #1026: ARROW-1209: [C++] [WIP] Support for reading avro from an AvroFileReader URL: https://github.com/apache/arrow/pull/1026#issuecomment-339332258 yeah i'll rebase this and see what needs to change. Think we were missing libjansson last time i touched this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Implement converter between Arrow record batches and Avro records > --- > > Key: ARROW-1209 > URL: https://issues.apache.org/jira/browse/ARROW-1209 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > Labels: pull-request-available > Fix For: 1.0.0 > > > This would be useful for streaming systems that need to consume or produce > Avro in C/C++ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218627#comment-16218627 ] ASF GitHub Bot commented on ARROW-1723: --- MaxRis commented on issue #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#issuecomment-339330749 @wesm some [more reading on what it takes to use WINDOWS_EXPORT_ALL_SYMBOL](https://blog.kitware.com/create-dlls-on-windows-without-declspec-using-new-cmake-export-all-feature/) (in the bottom of article) We might try to use it, but it might be not a good thing to rely on feature in CMake and to not have compatibility with any other build tools on Windows. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218613#comment-16218613 ] ASF GitHub Bot commented on ARROW-1723: --- JohnPJenkins commented on a change in pull request #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#discussion_r146854730 ## File path: cpp/cmake_modules/BuildUtils.cmake ## @@ -165,6 +165,8 @@ function(ADD_ARROW_LIB LIB_NAME) LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" OUTPUT_NAME ${LIB_NAME_STATIC}) + target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC) Review comment: That makes sense - looking more closely at the cmake file, the unix builds are unconditionally using PIC, so no issues there. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218610#comment-16218610 ] ASF GitHub Bot commented on ARROW-1723: --- wesm commented on a change in pull request #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#discussion_r146854212 ## File path: cpp/cmake_modules/BuildUtils.cmake ## @@ -165,6 +165,8 @@ function(ADD_ARROW_LIB LIB_NAME) LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" OUTPUT_NAME ${LIB_NAME_STATIC}) + target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC) Review comment: The objlib thing is an optimization for Unix/macOS so this part could be skipped on Windows This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1710) [Java] Decide what to do with non-nullable vectors in new vector class hierarchy
[ https://issues.apache.org/jira/browse/ARROW-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218606#comment-16218606 ] Wes McKinney commented on ARROW-1710: - See https://github.com/apache/arrow/blob/master/format/Layout.md#null-bitmaps. "Arrays having a 0 null count may choose to not allocate the null bitmap.". So when there are no nulls, it is not necessary to create a BitVector. It is also not necessary to populate the bit vector, so as you say waiting until the first null to create the bitmap might be the way to go. > [Java] Decide what to do with non-nullable vectors in new vector class > hierarchy > - > > Key: ARROW-1710 > URL: https://issues.apache.org/jira/browse/ARROW-1710 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java - Vectors >Reporter: Li Jin > Fix For: 0.8.0 > > > So far the consensus seems to be remove all non-nullable vectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1723) Windows: __declspec(dllexport) specified when building arrow static library
[ https://issues.apache.org/jira/browse/ARROW-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218601#comment-16218601 ] ASF GitHub Bot commented on ARROW-1723: --- MaxRis commented on a change in pull request #1244: ARROW-1723: [C++] add ARROW_STATIC to mark static libs on Windows URL: https://github.com/apache/arrow/pull/1244#discussion_r146852361 ## File path: cpp/cmake_modules/BuildUtils.cmake ## @@ -165,6 +165,8 @@ function(ADD_ARROW_LIB LIB_NAME) LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" OUTPUT_NAME ${LIB_NAME_STATIC}) + target_compile_definitions(${LIB_NAME}_static PUBLIC ARROW_STATIC) Review comment: @JohnPJenkins it seems logic should be changed only for Windows (to not increase compilation time on Unix). And on Windows maybe it makes sense to build directly from sources in cmake script, since it's not possible to reuse object files. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Windows: __declspec(dllexport) specified when building arrow static library > --- > > Key: ARROW-1723 > URL: https://issues.apache.org/jira/browse/ARROW-1723 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: John Jenkins > Labels: pull-request-available > > As I understand it, dllexport/dllimport should be left out when building and > using static libraries on Windows. A PR will follow shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1588) [C++/Format] Harden Decimal Format
[ https://issues.apache.org/jira/browse/ARROW-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218591#comment-16218591 ] ASF GitHub Bot commented on ARROW-1588: --- wesm closed pull request #1211: ARROW-1588: [C++/Format] Harden Decimal Format URL: https://github.com/apache/arrow/pull/1211 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 1178c658c..5df5e748f 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -42,6 +42,7 @@ install(FILES rle-encoding.h sse-util.h stl.h + type_traits.h visibility.h DESTINATION include/arrow/util) diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index 5a66d7e85..92bdcb5fc 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -28,7 +28,6 @@ #include "arrow/buffer.h" #include "arrow/memory_pool.h" -#include "arrow/status.h" #include "arrow/test-util.h" #include "arrow/util/bit-stream-utils.h" #include "arrow/util/bit-util.h" @@ -334,4 +333,36 @@ TEST(BitStreamUtil, ZigZag) { TestZigZag(-std::numeric_limits::max()); } +TEST(BitUtil, RoundTripLittleEndianTest) { + uint64_t value = 0xFF; + +#if ARROW_LITTLE_ENDIAN + uint64_t expected = value; +#else + uint64_t expected = std::numeric_limits::max() << 56; +#endif + + uint64_t little_endian_result = BitUtil::ToLittleEndian(value); + ASSERT_EQ(expected, little_endian_result); + + uint64_t from_little_endian = BitUtil::FromLittleEndian(little_endian_result); + ASSERT_EQ(value, from_little_endian); +} + +TEST(BitUtil, RoundTripBigEndianTest) { + uint64_t value = 0xFF; + +#if ARROW_LITTLE_ENDIAN + uint64_t expected = std::numeric_limits::max() << 56; +#else + uint64_t expected = value; +#endif + + uint64_t big_endian_result = BitUtil::ToBigEndian(value); + ASSERT_EQ(expected, big_endian_result); + + uint64_t from_big_endian = BitUtil::FromBigEndian(big_endian_result); + ASSERT_EQ(value, from_big_endian); +} + } // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 2509de21f..8043f90cc 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -56,6 +56,7 @@ #include #include "arrow/util/macros.h" +#include "arrow/util/type_traits.h" #include "arrow/util/visibility.h" #ifdef ARROW_USE_SSE @@ -305,7 +306,7 @@ static inline uint32_t ByteSwap(uint32_t value) { return static_cast(ARROW_BYTE_SWAP32(value)); } static inline int16_t ByteSwap(int16_t value) { - constexpr int16_t m = static_cast(0xff); + constexpr auto m = static_cast(0xff); return static_cast(((value >> 8) & m) | ((value & m) << 8)); } static inline uint16_t ByteSwap(uint16_t value) { @@ -331,8 +332,8 @@ static inline void ByteSwap(void* dst, const void* src, int len) { break; } - uint8_t* d = reinterpret_cast(dst); - const uint8_t* s = reinterpret_cast(src); + auto d = reinterpret_cast(dst); + auto s = reinterpret_cast(src); for (int i = 0; i < len; ++i) { d[i] = s[len - i - 1]; } @@ -341,36 +342,57 @@ static inline void ByteSwap(void* dst, const void* src, int len) { /// Converts to big endian format (if not already in big endian) from the /// machine's native endian format. #if ARROW_LITTLE_ENDIAN -static inline int64_t ToBigEndian(int64_t value) { return ByteSwap(value); } -static inline uint64_t ToBigEndian(uint64_t value) { return ByteSwap(value); } -static inline int32_t ToBigEndian(int32_t value) { return ByteSwap(value); } -static inline uint32_t ToBigEndian(uint32_t value) { return ByteSwap(value); } -static inline int16_t ToBigEndian(int16_t value) { return ByteSwap(value); } -static inline uint16_t ToBigEndian(uint16_t value) { return ByteSwap(value); } +template > +static inline T ToBigEndian(T value) { + return ByteSwap(value); +} + +template > +static inline T ToLittleEndian(T value) { + return value; +} #else -static inline int64_t ToBigEndian(int64_t val) { return val; } -static inline uint64_t ToBigEndian(uint64_t val) { return val; } -static inline int32_t ToBigEndian(int32_t val) { return val; } -static inline uint32_t ToBigEndian(uint32_t val) { return val; } -static inline int16_t ToBigEndian(int16_t val) { return val; } -static inline uint16_t ToBigEndian(uint16_t val) { return val; } +template > +static inline T ToBigEndian(T value) { + return value; +} #endif /// Converts from big endian format to the machine's native endian format. #if ARROW_LITTLE_ENDIAN -static inline int64_t FromBigEndian(int64_t value) { return ByteSwap(value); } -static inline uint64_t Fro