[jira] [Updated] (ARROW-8552) [Rust] support column iteration for parquet row
[ https://issues.apache.org/jira/browse/ARROW-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8552: -- Labels: pull-request-available (was: ) > [Rust] support column iteration for parquet row > --- > > Key: ARROW-8552 > URL: https://issues.apache.org/jira/browse/ARROW-8552 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: QP Hou >Priority: Minor > Labels: pull-request-available > > It would be useful to be able to iterate through all the columns in a parquet > row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8552) [Rust] support column iteration for parquet row
QP Hou created ARROW-8552: - Summary: [Rust] support column iteration for parquet row Key: ARROW-8552 URL: https://issues.apache.org/jira/browse/ARROW-8552 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: QP Hou It would be useful to be able to iterate through all the columns in a parquet row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8551) [CI][Gandiva] Use docker image with LLVM 8 to build gandiva linux jar
Prudhvi Porandla created ARROW-8551: --- Summary: [CI][Gandiva] Use docker image with LLVM 8 to build gandiva linux jar Key: ARROW-8551 URL: https://issues.apache.org/jira/browse/ARROW-8551 Project: Apache Arrow Issue Type: Task Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8551) [CI][Gandiva] Use LLVM 8 to build gandiva linux jar
[ https://issues.apache.org/jira/browse/ARROW-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla updated ARROW-8551: Summary: [CI][Gandiva] Use LLVM 8 to build gandiva linux jar (was: [CI][Gandiva] Use docker image with LLVM 8 to build gandiva linux jar) > [CI][Gandiva] Use LLVM 8 to build gandiva linux jar > --- > > Key: ARROW-8551 > URL: https://issues.apache.org/jira/browse/ARROW-8551 > Project: Apache Arrow > Issue Type: Task >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8551) [CI][Gandiva] Use LLVM 8 to build gandiva linux jar
[ https://issues.apache.org/jira/browse/ARROW-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8551: -- Labels: pull-request-available (was: ) > [CI][Gandiva] Use LLVM 8 to build gandiva linux jar > --- > > Key: ARROW-8551 > URL: https://issues.apache.org/jira/browse/ARROW-8551 > Project: Apache Arrow > Issue Type: Task >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8528) [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing
[ https://issues.apache.org/jira/browse/ARROW-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla closed ARROW-8528. --- Resolution: Cannot Reproduce > [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing > -- > > Key: ARROW-8528 > URL: https://issues.apache.org/jira/browse/ARROW-8528 > Project: Apache Arrow > Issue Type: Bug >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8528) [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing
[ https://issues.apache.org/jira/browse/ARROW-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089246#comment-17089246 ] Prudhvi Porandla commented on ARROW-8528: - resolved with [https://github.com/Homebrew/homebrew-core/pull/53445/files] > [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing > -- > > Key: ARROW-8528 > URL: https://issues.apache.org/jira/browse/ARROW-8528 > Project: Apache Arrow > Issue Type: Bug >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8531) [C++] Deprecate ARROW_USE_SIMD CMake option
[ https://issues.apache.org/jira/browse/ARROW-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089225#comment-17089225 ] Yibo Cai commented on ARROW-8531: - ARROW_USE_SIMD removed in https://github.com/apache/arrow/pull/6954 > [C++] Deprecate ARROW_USE_SIMD CMake option > --- > > Key: ARROW-8531 > URL: https://issues.apache.org/jira/browse/ARROW-8531 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is superseded by the {{ARROW_SIMD_LEVEL}} option -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8550) [CI] Don't run cron GHA jobs on forks
[ https://issues.apache.org/jira/browse/ARROW-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8550. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7005 [https://github.com/apache/arrow/pull/7005] > [CI] Don't run cron GHA jobs on forks > - > > Key: ARROW-8550 > URL: https://issues.apache.org/jira/browse/ARROW-8550 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > It's wasteful, and I'm tired of seeing them clogging up my Actions tab and > notifications. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
[ https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089198#comment-17089198 ] Mark Hildreth commented on ARROW-8508: -- I believe there are a few things going on here: 1.) I wouldn't consider myself an expert on these APIs, but it seems like the builders are being used correctly. 2.) The debug output definitely appears broken. I opened a [PR to fix this|https://github.com/apache/arrow/pull/7006], which puts it more in line with how the non-fixed size *ListArray* works. This should fix the *value()* method on the FixedSizeListArray to properly take the offset into the child array into account. 3.) As for the asserts that fail, this I'm less certain on. The values from these asserts are taken from the *values()* method, which seems to just return the underlying array without taking offsets into account. This seems to be similar to how other arrays work (including primitives), so my guess it is by design. I don't have an explanation for a better way of using the API, so maybe someone else can provide input. > [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets > > > Key: ARROW-8508 > URL: https://issues.apache.org/jira/browse/ARROW-8508 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.16.0 >Reporter: Christian Beilschmidt >Priority: Major > Labels: pull-request-available > > I created an example of storing multi points with Arrow. > # A coordinate consists of two floats (Float64Builder) > # A multi point consists of one or more coordinates (FixedSizeListBuilder) > # A list of multi points consists of multiple multi points (ListBuilder) > This is the corresponding code snippet: > {code:java} > let float_builder = arrow::array::Float64Builder::new(0); > let coordinate_builder = > arrow::array::FixedSizeListBuilder::new(float_builder, 2); > let mut multi_point_builder = > arrow::array::ListBuilder::new(coordinate_builder); > multi_point_builder > .values() > .values() > .append_slice(&[0.0, 0.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[1.0, 1.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder.append(true).unwrap(); // first multi point > multi_point_builder > .values() > .values() > .append_slice(&[2.0, 2.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[3.0, 3.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[4.0, 4.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder.append(true).unwrap(); // second multi point > let multi_point = dbg!(multi_point_builder.finish()); > let first_multi_point_ref = multi_point.value(0); > let first_multi_point: ::array::FixedSizeListArray = > first_multi_point_ref.as_any().downcast_ref().unwrap(); > let coordinates_ref = first_multi_point.values(); > let coordinates: = > coordinates_ref.as_any().downcast_ref().unwrap(); > assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]); > let second_multi_point_ref = multi_point.value(1); > let second_multi_point: ::array::FixedSizeListArray = > second_multi_point_ref.as_any().downcast_ref().unwrap(); > let coordinates_ref = second_multi_point.values(); > let coordinates: = > coordinates_ref.as_any().downcast_ref().unwrap(); > assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, > 4.1]); > {code} > The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, > 2.1]}}. > Moreover, the debug output produced from {{dbg!}} confirms this: > {noformat} > [ > FixedSizeListArray<2> > [ > PrimitiveArray > [ > 0.0, > 0.1, > ], > PrimitiveArray > [ > 1.0, > 1.1, > ], > ], > FixedSizeListArray<2> > [ > PrimitiveArray > [ > 0.0, > 0.1, > ], > PrimitiveArray > [ > 1.0, > 1.1, > ], > PrimitiveArray > [ > 2.0, > 2.1, > ], > ], > ]{noformat} > The second list should contain the values 2-4. > > So either I am using the builder wrong or there is a bug with the offsets. I > used {{0.16}} as well as the current {{master}} from GitHub. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8537) [C++] Performance regression from ARROW-8523
[ https://issues.apache.org/jira/browse/ARROW-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8537: -- Labels: pull-request-available (was: ) > [C++] Performance regression from ARROW-8523 > > > Key: ARROW-8537 > URL: https://issues.apache.org/jira/browse/ARROW-8537 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yibo Cai >Priority: Major > Labels: pull-request-available > > I optimized BitmapReader in [this > PR|https://github.com/apache/arrow/pull/6986] and see performance uplift of > BitmapReader test case. I didn't check other test cases as this change looks > trivial. > I reviewed all test cases just now and see big performance drop of 4 cases, > details at [PR > link|https://github.com/apache/arrow/pull/6986#issuecomment-616915079]. > I also compared performance of code using BitmapReader, no obvious changes > found. Looks we should revert that PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
[ https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8508: -- Labels: pull-request-available (was: ) > [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets > > > Key: ARROW-8508 > URL: https://issues.apache.org/jira/browse/ARROW-8508 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.16.0 >Reporter: Christian Beilschmidt >Priority: Major > Labels: pull-request-available > > I created an example of storing multi points with Arrow. > # A coordinate consists of two floats (Float64Builder) > # A multi point consists of one or more coordinates (FixedSizeListBuilder) > # A list of multi points consists of multiple multi points (ListBuilder) > This is the corresponding code snippet: > {code:java} > let float_builder = arrow::array::Float64Builder::new(0); > let coordinate_builder = > arrow::array::FixedSizeListBuilder::new(float_builder, 2); > let mut multi_point_builder = > arrow::array::ListBuilder::new(coordinate_builder); > multi_point_builder > .values() > .values() > .append_slice(&[0.0, 0.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[1.0, 1.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder.append(true).unwrap(); // first multi point > multi_point_builder > .values() > .values() > .append_slice(&[2.0, 2.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[3.0, 3.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder > .values() > .values() > .append_slice(&[4.0, 4.1]) > .unwrap(); > multi_point_builder.values().append(true).unwrap(); > multi_point_builder.append(true).unwrap(); // second multi point > let multi_point = dbg!(multi_point_builder.finish()); > let first_multi_point_ref = multi_point.value(0); > let first_multi_point: ::array::FixedSizeListArray = > first_multi_point_ref.as_any().downcast_ref().unwrap(); > let coordinates_ref = first_multi_point.values(); > let coordinates: = > coordinates_ref.as_any().downcast_ref().unwrap(); > assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]); > let second_multi_point_ref = multi_point.value(1); > let second_multi_point: ::array::FixedSizeListArray = > second_multi_point_ref.as_any().downcast_ref().unwrap(); > let coordinates_ref = second_multi_point.values(); > let coordinates: = > coordinates_ref.as_any().downcast_ref().unwrap(); > assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, > 4.1]); > {code} > The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, > 2.1]}}. > Moreover, the debug output produced from {{dbg!}} confirms this: > {noformat} > [ > FixedSizeListArray<2> > [ > PrimitiveArray > [ > 0.0, > 0.1, > ], > PrimitiveArray > [ > 1.0, > 1.1, > ], > ], > FixedSizeListArray<2> > [ > PrimitiveArray > [ > 0.0, > 0.1, > ], > PrimitiveArray > [ > 1.0, > 1.1, > ], > PrimitiveArray > [ > 2.0, > 2.1, > ], > ], > ]{noformat} > The second list should contain the values 2-4. > > So either I am using the builder wrong or there is a bug with the offsets. I > used {{0.16}} as well as the current {{master}} from GitHub. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8550) [CI] Don't run cron GHA jobs on forks
[ https://issues.apache.org/jira/browse/ARROW-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089125#comment-17089125 ] Antoine Pitrou commented on ARROW-8550: --- I don't really agree, I use them from time to time. > [CI] Don't run cron GHA jobs on forks > - > > Key: ARROW-8550 > URL: https://issues.apache.org/jira/browse/ARROW-8550 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > > It's wasteful, and I'm tired of seeing them clogging up my Actions tab and > notifications. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8512) [C++] Delete unused compute expr prototype code
[ https://issues.apache.org/jira/browse/ARROW-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8512: --- Assignee: Wes McKinney > [C++] Delete unused compute expr prototype code > --- > > Key: ARROW-8512 > URL: https://issues.apache.org/jira/browse/ARROW-8512 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Most of the code added in > https://github.com/apache/arrow/commit/08ca13f83f3d6dbd818c4280d619dae306aa9de5 > can be deleted. I may leave some of the "shape" types in case we can make > use of those. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8550) [CI] Don't run cron GHA jobs on forks
Neal Richardson created ARROW-8550: -- Summary: [CI] Don't run cron GHA jobs on forks Key: ARROW-8550 URL: https://issues.apache.org/jira/browse/ARROW-8550 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Neal Richardson Assignee: Neal Richardson It's wasteful, and I'm tired of seeing them clogging up my Actions tab and notifications. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8550) [CI] Don't run cron GHA jobs on forks
[ https://issues.apache.org/jira/browse/ARROW-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8550: -- Labels: pull-request-available (was: ) > [CI] Don't run cron GHA jobs on forks > - > > Key: ARROW-8550 > URL: https://issues.apache.org/jira/browse/ARROW-8550 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > > It's wasteful, and I'm tired of seeing them clogging up my Actions tab and > notifications. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8549) [R] Assorted post-0.17 release cleanups
[ https://issues.apache.org/jira/browse/ARROW-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8549: -- Labels: pull-request-available (was: ) > [R] Assorted post-0.17 release cleanups > --- > > Key: ARROW-8549 > URL: https://issues.apache.org/jira/browse/ARROW-8549 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8549) [R] Assorted post-0.17 release cleanups
Neal Richardson created ARROW-8549: -- Summary: [R] Assorted post-0.17 release cleanups Key: ARROW-8549 URL: https://issues.apache.org/jira/browse/ARROW-8549 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7011) [C++] Implement casts from float/double to decimal128
[ https://issues.apache.org/jira/browse/ARROW-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089080#comment-17089080 ] Jacek Pliszka commented on ARROW-7011: -- Would it be enough to: multiply by 10**scale, cast to integer, cast to decimal128, corect the scale ? > [C++] Implement casts from float/double to decimal128 > - > > Key: ARROW-7011 > URL: https://issues.apache.org/jira/browse/ARROW-7011 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > see also ARROW-5905, ARROW-7010 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8548) [Website] 0.17 release post
Neal Richardson created ARROW-8548: -- Summary: [Website] 0.17 release post Key: ARROW-8548 URL: https://issues.apache.org/jira/browse/ARROW-8548 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089069#comment-17089069 ] Jacek Pliszka commented on ARROW-8545: -- Cast from float should allow quick conversion > [Python] Allow fast writing of Decimal column to parquet > > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8542) [Release] Fix checksum url in the website post release script
[ https://issues.apache.org/jira/browse/ARROW-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8542. - Resolution: Fixed Issue resolved by pull request 6999 [https://github.com/apache/arrow/pull/6999] > [Release] Fix checksum url in the website post release script > - > > Key: ARROW-8542 > URL: https://issues.apache.org/jira/browse/ARROW-8542 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The issue was captured here > https://github.com/apache/arrow-site/pull/53#discussion_r411728907 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8541) [Release] Don't remove previous source releases automatically
[ https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089058#comment-17089058 ] Kouhei Sutou commented on ARROW-8541: - Wow! I didn't know the archive site. > [Release] Don't remove previous source releases automatically > - > > Key: ARROW-8541 > URL: https://issues.apache.org/jira/browse/ARROW-8541 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We should keep at least the last three source tarballs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8539: Fix Version/s: 1.0.0 > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Assignee: Yosuke Shiro >Priority: Critical > Fix For: 1.0.0 > > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8545: --- Summary: [Python] Allow fast writing of Decimal column to parquet (was: Allow fast writing of Decimal column to parquet) > [Python] Allow fast writing of Decimal column to parquet > > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8547) [Rust] Implement JsonEqual for UnionArray
Paddy Horan created ARROW-8547: -- Summary: [Rust] Implement JsonEqual for UnionArray Key: ARROW-8547 URL: https://issues.apache.org/jira/browse/ARROW-8547 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8546) [Rust] Handle UnionArray in get_fb_field_type
Paddy Horan created ARROW-8546: -- Summary: [Rust] Handle UnionArray in get_fb_field_type Key: ARROW-8546 URL: https://issues.apache.org/jira/browse/ARROW-8546 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8545) Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089011#comment-17089011 ] Jacek Pliszka edited comment on ARROW-8545 at 4/21/20, 7:54 PM: OK, I checked and This is my version: {code:java} pat = pa.Table.from_pandas(df) t3 = time() print(t3-t2) pat = pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3))) t4 = time() print(t4 - t3) pq.write_table(pat, '/tmp/testabd.pq') t5 = time() print(t5 - t4) {code} And we are getting here A) 0.3s for conversion from pandas to arrow Table B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast implemented from double to decimal(38, 3) C) 2.8s for writing table to parquet file - is it fast enough for you B and C are separate topics and should have separate issues. In B decimal128 should be easier if this is enough for you was (Author: jacek.pliszka): OK, I checked and This is my version: {code:java} pat = pa.Table.from_pandas(df) t3 = time() print(t3-t2) pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3))) t4 = time() print(t4 - t3) pq.write_table(pat, '/tmp/testabd.pq') t5 = time() print(t5 - t4) {code} And we are getting here A) 0.3s for conversion from pandas to arrow Table B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast implemented from double to decimal(38, 3) C) 2.8s for writing table to parquet file - is it fast enough for you B and C are separate topics and should have separate issues. In B decimal128 should be easier if this is enough for you > Allow fast writing of Decimal column to parquet > --- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8545) Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089011#comment-17089011 ] Jacek Pliszka edited comment on ARROW-8545 at 4/21/20, 7:53 PM: OK, I checked and This is my version: {code:java} pat = pa.Table.from_pandas(df) t3 = time() print(t3-t2) pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3))) t4 = time() print(t4 - t3) pq.write_table(pat, '/tmp/testabd.pq') t5 = time() print(t5 - t4) {code} And we are getting here A) 0.3s for conversion from pandas to arrow Table B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast implemented from double to decimal(38, 3) C) 2.8s for writing table to parquet file - is it fast enough for you B and C are separate topics and should have separate issues. In B decimal128 should be easier if this is enough for you was (Author: jacek.pliszka): OK, I checked and This is my version: {code:java} pat = pa.Table.from_pandas(df) t3 = time()print(t3-t2) pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3))) t4 = time() print(t4 - t3) pq.write_table(pat, '/tmp/testabd.pq') t5 = time() print(t5 - t4) {code} And we are getting here A) 0.3s for conversion from pandas to arrow Table B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast implemented from double to decimal(38, 3) C) 2.8s for writing table to parquet file - is it fast enough for you B and C are separate topics and should have separate issues. In B decimal128 should be easier if this is enough for you > Allow fast writing of Decimal column to parquet > --- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8545) Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089011#comment-17089011 ] Jacek Pliszka commented on ARROW-8545: -- OK, I checked and This is my version: {code:java} pat = pa.Table.from_pandas(df) t3 = time()print(t3-t2) pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3))) t4 = time() print(t4 - t3) pq.write_table(pat, '/tmp/testabd.pq') t5 = time() print(t5 - t4) {code} And we are getting here A) 0.3s for conversion from pandas to arrow Table B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast implemented from double to decimal(38, 3) C) 2.8s for writing table to parquet file - is it fast enough for you B and C are separate topics and should have separate issues. In B decimal128 should be easier if this is enough for you > Allow fast writing of Decimal column to parquet > --- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8545) Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088988#comment-17088988 ] Jacek Pliszka edited comment on ARROW-8545 at 4/21/20, 7:20 PM: Once you have decimal - is writing fast enough? Because actually you are talking about 2 different things: # cast to arrow decimal - not sure if it is implemented but it is relatively easy # fast writing decimal to parquet - is it fast enough for you was (Author: jacek.pliszka): My suggestion would be to split into 2 pieces: # cast to decimal - not sure if it is implemented but it is relatively easy # writing decimal to parquet - not sure what is the status of it as well > Allow fast writing of Decimal column to parquet > --- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-8543: - Assignee: Mayur Srivastava > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Assignee: Mayur Srivastava >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The current coalescing algorithm is a two pass algorithm (where N is number > of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an single pass algorithm that > computes coalesced ranges while making the first pass over the ranges in the > list. This algorithm is also shorter in lines of code and hence (hopefully) > more maintainable in long term. > Correction: Post sorting, the current algorithm is O(N) and the improvement > is O(N). I called the current algo O(N^2) due to an oversight. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-8543. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7002 [https://github.com/apache/arrow/pull/7002] > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The current coalescing algorithm is a two pass algorithm (where N is number > of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an single pass algorithm that > computes coalesced ranges while making the first pass over the ranges in the > list. This algorithm is also shorter in lines of code and hence (hopefully) > more maintainable in long term. > Correction: Post sorting, the current algorithm is O(N) and the improvement > is O(N). I called the current algo O(N^2) due to an oversight. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8545) Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088988#comment-17088988 ] Jacek Pliszka commented on ARROW-8545: -- My suggestion would be to split into 2 pieces: # cast to decimal - not sure if it is implemented but it is relatively easy # writing decimal to parquet - not sure what is the status of it as well > Allow fast writing of Decimal column to parquet > --- > > Key: ARROW-8545 > URL: https://issues.apache.org/jira/browse/ARROW-8545 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.0 >Reporter: Fons de Leeuw >Priority: Minor > > Currently, when one wants to use a decimal datatype in Pandas, the only > possibility is to use the `decimal.Decimal` standard-libary type. This is > then an "object" column in the DataFrame. > Arrow can write a column of decimal type to Parquet, which is quite > impressive given that [fastparquet does not write decimals|#data-types]] at > all. However, the writing is *very* slow, in the code snippet below a factor > of 4. > *Improvements* > Of course the best outcome would be if the conversion of a decimal column can > be made faster, but I am not familiar enough with pandas internals to know if > that's possible. (This same behavior also applies to `.to_pickle` etc.) > It would be nice, if a warning is shown that object-typed columns are being > converted which is very slow. That would at least make this behavior more > explicit. > Now, if fast parsing of a decimal.Decimal object column is not possible, it > would be nice if a workaround is possible. For example, pass an int and then > shift the dot "x" places to the left. (It is already possible to pass an int > column and specify "decimal" dtype in the Arrow schema during > `pa.Table.from_pandas()` but then it simply becomes a decimal without > decimals.) Also, it might be nice if it can be encoded as a 128-bit byte > string in the pandas column and then directly interpreted by Arrow. > *Usecase* > I need to save large dataframes (~10GB) of geospatial data with > latitude/longitude. I can't use float as comparisons need to be exact, and > the BigQuery "clustering" feature needs either an integer or a decimal but > not a float. In the meantime, I have to do a workaround where I use only ints > (the original number multiplied by 1000.) > *Snippet* > {code:java} > import decimal > from time import time > import numpy as np > import pandas as pd > d = dict() > for col in "abcdefghijklmnopqrstuvwxyz": > d[col] = np.random.rand(int(1E7)) * 100 > df = pd.DataFrame(d) > t0 = time() > df.to_parquet("/tmp/testabc.pq", engine="pyarrow") > t1 = time() > df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) > t2 = time() > df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") > t3 = time() > print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal > column {t3-t2:.3f}s") > # Saving the normal dataframe took 4.430s, with one decimal column > 17.673s{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8545) Allow fast writing of Decimal column to parquet
[ https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fons de Leeuw updated ARROW-8545: - Description: Currently, when one wants to use a decimal datatype in Pandas, the only possibility is to use the `decimal.Decimal` standard-libary type. This is then an "object" column in the DataFrame. Arrow can write a column of decimal type to Parquet, which is quite impressive given that [fastparquet does not write decimals|#data-types]] at all. However, the writing is *very* slow, in the code snippet below a factor of 4. *Improvements* Of course the best outcome would be if the conversion of a decimal column can be made faster, but I am not familiar enough with pandas internals to know if that's possible. (This same behavior also applies to `.to_pickle` etc.) It would be nice, if a warning is shown that object-typed columns are being converted which is very slow. That would at least make this behavior more explicit. Now, if fast parsing of a decimal.Decimal object column is not possible, it would be nice if a workaround is possible. For example, pass an int and then shift the dot "x" places to the left. (It is already possible to pass an int column and specify "decimal" dtype in the Arrow schema during `pa.Table.from_pandas()` but then it simply becomes a decimal without decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string in the pandas column and then directly interpreted by Arrow. *Usecase* I need to save large dataframes (~10GB) of geospatial data with latitude/longitude. I can't use float as comparisons need to be exact, and the BigQuery "clustering" feature needs either an integer or a decimal but not a float. In the meantime, I have to do a workaround where I use only ints (the original number multiplied by 1000.) *Snippet* {code:java} import decimal from time import time import numpy as np import pandas as pd d = dict() for col in "abcdefghijklmnopqrstuvwxyz": d[col] = np.random.rand(int(1E7)) * 100 df = pd.DataFrame(d) t0 = time() df.to_parquet("/tmp/testabc.pq", engine="pyarrow") t1 = time() df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) t2 = time() df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") t3 = time() print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal column {t3-t2:.3f}s") # Saving the normal dataframe took 4.430s, with one decimal column 17.673s{code} was: Currently, when one wants to use a decimal datatype in Pandas, the only possibility is to use the `decimal.Decimal` standard-libary type. This is then an "object" column in the DataFrame. Arrow can write a column of decimal type to Parquet, which is quite impressive given that [fastparquet does not write decimals|[https://fastparquet.readthedocs.io/en/latest/details.html#data-types]] at all. However, the writing is *very* slow, in the code snippet below a factor of 4. *Improvements*** Of course the best outcome would be if the conversion of a decimal column can be made faster, but I am not familiar enough with pandas internals to know if that's possible. (This same behavior also applies to `.to_pickle` etc.) It would be nice, if a warning is shown that object-typed columns are being converted which is very slow. That would at least make this behavior more explicit. Now, if fast parsing of a decimal.Decimal object column is not possible, it would be nice if a workaround is possible. For example, pass an int and then shift the dot "x" places to the left. (It is already possible to pass an int column and specify "decimal" dtype in the Arrow schema during `pa.Table.from_pandas()` but then it simply becomes a decimal without decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string in the pandas column and then directly interpreted by Arrow. *Usecase* I need to save large dataframes (~10GB) of geospatial data with latitude/longitude. I can't use float as comparisons need to be exact, and the BigQuery "clustering" feature needs either an integer or a decimal but not a float. In the meantime, I have to do a workaround where I use only ints (the original number multiplied by 1000.) *Snippet* {code:java} {code} *import decimal from time import time import numpy as np import pandas as pd d = dict() for col in "abcdefghijklmnopqrstuvwxyz": d[col] = np.random.rand(int(1E7)) * 100 df = pd.DataFrame(d) t0 = time() df.to_parquet("/tmp/testabc.pq", engine="pyarrow") t1 = time() df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) t2 = time() df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") t3 = time() print(f"Saving the normal dataframe took \{t1-t0:.3f}s, with one decimal column \{t3-t2:.3f}s")* *# Saving the normal dataframe took 4.430s, with one decimal column 17.673s*** > Allow fast writing of Decimal column to parquet >
[jira] [Created] (ARROW-8545) Allow fast writing of Decimal column to parquet
Fons de Leeuw created ARROW-8545: Summary: Allow fast writing of Decimal column to parquet Key: ARROW-8545 URL: https://issues.apache.org/jira/browse/ARROW-8545 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.17.0 Reporter: Fons de Leeuw Currently, when one wants to use a decimal datatype in Pandas, the only possibility is to use the `decimal.Decimal` standard-libary type. This is then an "object" column in the DataFrame. Arrow can write a column of decimal type to Parquet, which is quite impressive given that [fastparquet does not write decimals|[https://fastparquet.readthedocs.io/en/latest/details.html#data-types]] at all. However, the writing is *very* slow, in the code snippet below a factor of 4. *Improvements*** Of course the best outcome would be if the conversion of a decimal column can be made faster, but I am not familiar enough with pandas internals to know if that's possible. (This same behavior also applies to `.to_pickle` etc.) It would be nice, if a warning is shown that object-typed columns are being converted which is very slow. That would at least make this behavior more explicit. Now, if fast parsing of a decimal.Decimal object column is not possible, it would be nice if a workaround is possible. For example, pass an int and then shift the dot "x" places to the left. (It is already possible to pass an int column and specify "decimal" dtype in the Arrow schema during `pa.Table.from_pandas()` but then it simply becomes a decimal without decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string in the pandas column and then directly interpreted by Arrow. *Usecase* I need to save large dataframes (~10GB) of geospatial data with latitude/longitude. I can't use float as comparisons need to be exact, and the BigQuery "clustering" feature needs either an integer or a decimal but not a float. In the meantime, I have to do a workaround where I use only ints (the original number multiplied by 1000.) *Snippet* {code:java} {code} *import decimal from time import time import numpy as np import pandas as pd d = dict() for col in "abcdefghijklmnopqrstuvwxyz": d[col] = np.random.rand(int(1E7)) * 100 df = pd.DataFrame(d) t0 = time() df.to_parquet("/tmp/testabc.pq", engine="pyarrow") t1 = time() df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal) t2 = time() df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow") t3 = time() print(f"Saving the normal dataframe took \{t1-t0:.3f}s, with one decimal column \{t3-t2:.3f}s")* *# Saving the normal dataframe took 4.430s, with one decimal column 17.673s*** -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayur Srivastava updated ARROW-8543: Description: The current coalescing algorithm is a two pass algorithm (where N is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. Correction: Post sorting, the current algorithm is O(N) and the improvement is O(N). I called the current algo O(N^2) due to an oversight. was: The current coalescing algorithm is a O(N^2) two pass algorithm (where N is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an O(N) single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. Correction: Post sorting, the current algorithm is O(N) and the improvement is O(N). > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a two pass algorithm (where N is number > of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an single pass algorithm that > computes coalesced ranges while making the first pass over the ranges in the > list. This algorithm is also shorter in lines of code and hence (hopefully) > more maintainable in long term. > Correction: Post sorting, the current algorithm is O(N) and the improvement > is O(N). I called the current algo O(N^2) due to an oversight. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088966#comment-17088966 ] Mayur Srivastava commented on ARROW-8543: - Thank you for fixing the PR! This is my first contribution so I'm not well versed. Thanks, Mayur > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > Correction: Post sorting, the current algorithm is O(N) and the improvement > is O(N). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayur Srivastava updated ARROW-8543: Description: The current coalescing algorithm is a O(N^2) two pass algorithm (where N is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an O(N) single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. Correction: Post sorting, the current algorithm is O(N) and the improvement is O(N). was: The current coalescing algorithm is a O(N^2) two pass algorithm (where N is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an O(N) single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > Correction: Post sorting, the current algorithm is O(N) and the improvement > is O(N). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088959#comment-17088959 ] Mayur Srivastava commented on ARROW-8543: - You are right. I should correct myself. The correct statement is "change 2 pass algorithm (O(N)) with a constant 2 to 1 pass algorithm (O(N)) with a constant 1". What do you think? > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088949#comment-17088949 ] Antoine Pitrou commented on ARROW-8543: --- That part is O(N). You missed the {{start = next}} line. > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088946#comment-17088946 ] Mayur Srivastava commented on ARROW-8543: - Hi [~apitrou], Thanks for looking into it! I agree on O(N log N) due to sorting. My comment was mainly on post sorting algorithm. The algorithm is as follows (let me know if I'm making a mistake) (ref: [https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.cc]): {noformat} start=ranges.begin(), prev=start, next=prev while (++next != end) { if (isLargerThanHole(prev, next)) { if (next - start > 1) { CoalesceUtilLargeEnough(start, next, coalesced output) } else { // append start to coalesced output } start = next; } prev = next; } if (next - start > 1) { CoalesceUtilLargeEnough(start, next, coalesced output) } else { // append start to coalesced output }{noformat} To simplify, let's assume there is no hole in the ranges. Then, we increment 'next' till the end in the 'while' loop. This is the first pass iterating from start to end in Coalesce(). Then we call CoalesceUtilLargeEnough() which will iterate from start to end and append coalesced ranges to the result. This is the second pass in CoalesceUntilLargeEnough(). This is the reason I called it two pass algorithm. My proposal ([https://github.com/apache/arrow/pull/7002]) is to change it to a single pass. Other than single pass benefits, it does less copying and is shorter in number of lines of code. Let me know what you think about it. Thanks, Mayur > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8544) [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to get around rate limiting
[ https://issues.apache.org/jira/browse/ARROW-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-8544. Resolution: Fixed Issue resolved by pull request 6994 [https://github.com/apache/arrow/pull/6994] > [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to > get around rate limiting > -- > > Key: ARROW-8544 > URL: https://issues.apache.org/jira/browse/ARROW-8544 > Project: Apache Arrow > Issue Type: Sub-task > Components: Continuous Integration, Developer Tools >Reporter: Krisztian Szucs >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > Crossbow already queries commit statuses for generating a static github page. > Use this mechanism to reduce the required github api calls on the future > dashboard by serializing the already queried API responses. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8544) [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to get around rate limiting
Krisztian Szucs created ARROW-8544: -- Summary: [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to get around rate limiting Key: ARROW-8544 URL: https://issues.apache.org/jira/browse/ARROW-8544 Project: Apache Arrow Issue Type: Sub-task Components: Continuous Integration, Developer Tools Reporter: Krisztian Szucs Assignee: Ben Kietzman Fix For: 1.0.0 Crossbow already queries commit statuses for generating a static github page. Use this mechanism to reduce the required github api calls on the future dashboard by serializing the already queried API responses. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8543: -- Labels: pull-request-available (was: ) > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > Labels: pull-request-available > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088901#comment-17088901 ] Antoine Pitrou commented on ARROW-8543: --- Can you explain why the current algorithm is O(N^2)? CoalesceUntilLargeEnough() is called on disjoint subsets of the ranges, so each range is examined only once. (however, the algorithm is O(N log N) due to sorting) > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8529) [C++] Fix usage of NextCounts() in GetBatchWithDict[Spaced]
[ https://issues.apache.org/jira/browse/ARROW-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-8529. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6991 [https://github.com/apache/arrow/pull/6991] > [C++] Fix usage of NextCounts() in GetBatchWithDict[Spaced] > --- > > Key: ARROW-8529 > URL: https://issues.apache.org/jira/browse/ARROW-8529 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > See discussion in ARROW-8486 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm
[ https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayur Srivastava updated ARROW-8543: Description: The current coalescing algorithm is a O(N^2) two pass algorithm (where N is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an O(N) single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. was: The current coalescing algorithm is a O(n^2) two pass algorithm (where n is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an O(n) single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. > [C++] IO: single pass coalescing algorithm > -- > > Key: ARROW-8543 > URL: https://issues.apache.org/jira/browse/ARROW-8543 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mayur Srivastava >Priority: Major > > The current coalescing algorithm is a O(N^2) two pass algorithm (where N is > number of ranges) (first implemented in > https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the > Coalesce() function finds the begin and end of a candidate range that can be > coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes > over the ranges from begin to end and adds coalesced range to the result > (out). > The proposal is to convert the algorithm to an O(N) single pass algorithm > that computes coalesced ranges while making the first pass over the ranges in > the list. This algorithm is also shorter in lines of code and hence > (hopefully) more maintainable in long term. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8543) [C++] IO: single pass coalescing algorithm
Mayur Srivastava created ARROW-8543: --- Summary: [C++] IO: single pass coalescing algorithm Key: ARROW-8543 URL: https://issues.apache.org/jira/browse/ARROW-8543 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Mayur Srivastava The current coalescing algorithm is a O(n^2) two pass algorithm (where n is number of ranges) (first implemented in https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the Coalesce() function finds the begin and end of a candidate range that can be coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes over the ranges from begin to end and adds coalesced range to the result (out). The proposal is to convert the algorithm to an O(n) single pass algorithm that computes coalesced ranges while making the first pass over the ranges in the list. This algorithm is also shorter in lines of code and hence (hopefully) more maintainable in long term. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-2714) [C++/Python] Variable step size slicing for arrays
[ https://issues.apache.org/jira/browse/ARROW-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2714. - Resolution: Fixed Issue resolved by pull request 6970 [https://github.com/apache/arrow/pull/6970] > [C++/Python] Variable step size slicing for arrays > -- > > Key: ARROW-2714 > URL: https://issues.apache.org/jira/browse/ARROW-2714 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Florian Jetter >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Array slicing should support variable step sizes > The current behavior raises an {{IndexError}}, e.g. > {code} > In [8]: import pyarrow as pa > In [9]: pa.array([1, 2, 3])[::-1] > --- > IndexError Traceback (most recent call last) > in () > > 1 pa.array([1, 2, 3])[::-1] > array.pxi in pyarrow.lib.Array.__getitem__() > array.pxi in pyarrow.lib._normalize_slice() > IndexError: only slices with step 1 supported > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088727#comment-17088727 ] Andy Grove commented on ARROW-8536: --- [~d...@danburkert.com] I wonder if you could provide some guidance on this? cc [~paddyhoran] [~nevime] > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Critical > Fix For: 1.0.0 > > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a "format" directory in the root of your file > system (or at least at a higher level than where cargo is building code) and > place the Flight.proto file there (making sure to use the 0.17.0 version, > which can be found in the source release [1]). > [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-8536: -- Description: When using Arrow 0.17.0 as a dependency, it is likely that you will get the error "Failed to locate format/Flight.proto in any parent directory". This is caused by the custom build script in the arrow-flight crate, which expects to find a "format/Flight.proto" file in a parent directory. This works when building the crate from within the Arrow source tree, but unfortunately doesn't work for the published crate, since the Flight.proto file was not published as part of the crate. The workaround is to create a "format" directory in the root of your file system (or at least at a higher level than where cargo is building code) and place the Flight.proto file there (making sure to use the 0.17.0 version, which can be found in the source release [1]). [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0] was: When using Arrow 0.17.0 as a dependency, it is likely that you will get the error "Failed to locate format/Flight.proto in any parent directory". This is caused by the custom build script in the arrow-flight crate, which expects to find a "format/Flight.proto" file in a parent directory. This works when building the crate from within the Arrow source tree, but unfortunately doesn't work for the published crate, since the Flight.proto file was not published as part of the crate. The workaround is to create a top-level "format" directory in your Rust project and place the Flight.proto file there (making sure to use the 0.17.0 version, which can be found in the source release [1]). [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0 > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Critical > Fix For: 1.0.0 > > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a "format" directory in the root of your file > system (or at least at a higher level than where cargo is building code) and > place the Flight.proto file there (making sure to use the 0.17.0 version, > which can be found in the source release [1]). > [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088678#comment-17088678 ] Antoine Pitrou commented on ARROW-8539: --- Thank you! > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Assignee: Yosuke Shiro >Priority: Critical > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088674#comment-17088674 ] Yosuke Shiro commented on ARROW-8539: - [https://github.com/Homebrew/homebrew-core/pull/53445.] has been merged. CI is green. [https://github.com/apache/arrow/pull/6991/checks?check_run_id=605238698] > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Assignee: Yosuke Shiro >Priority: Critical > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro resolved ARROW-8539. - Resolution: Resolved > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Assignee: Yosuke Shiro >Priority: Critical > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-8536: - Assignee: Andy Grove > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Critical > Fix For: 1.0.0 > > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a top-level "format" directory in your Rust > project and place the Flight.proto file there (making sure to use the 0.17.0 > version, which can be found in the source release [1]). > [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-8536: -- Description: When using Arrow 0.17.0 as a dependency, it is likely that you will get the error "Failed to locate format/Flight.proto in any parent directory". This is caused by the custom build script in the arrow-flight crate, which expects to find a "format/Flight.proto" file in a parent directory. This works when building the crate from within the Arrow source tree, but unfortunately doesn't work for the published crate, since the Flight.proto file was not published as part of the crate. The workaround is to create a top-level "format" directory in your Rust project and place the Flight.proto file there (making sure to use the 0.17.0 version, which can be found in the source release [1]). [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0 was: When using Arrow 0.17.0 as a dependency, it is likely that you will get the error "Failed to locate format/Flight.proto in any parent directory". The workaround is to create a top-level "format" directory in your Rust project and place the Flight.proto file there (making sure to use the 0.17.0 version > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Priority: Critical > Fix For: 1.0.0 > > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". This is > caused by the custom build script in the arrow-flight crate, which expects to > find a "format/Flight.proto" file in a parent directory. This works when > building the crate from within the Arrow source tree, but unfortunately > doesn't work for the published crate, since the Flight.proto file was not > published as part of the crate. > The workaround is to create a top-level "format" directory in your Rust > project and place the Flight.proto file there (making sure to use the 0.17.0 > version, which can be found in the source release [1]). > [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-8536: -- Description: When using Arrow 0.17.0 as a dependency, it is likely that you will get the error "Failed to locate format/Flight.proto in any parent directory". The workaround is to create a top-level "format" directory in your Rust project and place the Flight.proto file there (making sure to use the 0.17.0 version was: When using Arrow 0.17.0 as a dependency, it is likely that you will get the error "Failed to locate format/Flight.proto in any parent directory". The workaround is to create a directoy `/format` in the root of your file system and place the Flight.proto file there. > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Priority: Critical > Fix For: 1.0.0 > > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". > The workaround is to create a top-level "format" directory in your Rust > project and place the Flight.proto file there (making sure to use the 0.17.0 > version > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8065: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > > Currently: a fragment is a product of a scan; it is a lazy collection of scan > tasks corresponding to a data source which is logically singular (like a > single file, a single row group, ...). It would be more useful if instead a > fragment were the direct object of a scan; one scans a fragment (or a > collection of fragments): > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an > overload to support predicate pushdown in FileSystemDataset and UnionDataset > {{Dataset::GetFragments(std::shared_ptr predicate)}}. > # Expose lazy accessor to Fragment::physical_schema() > # Consolidate ScanOptions and ScanContext > This will lessen the cognitive dissonance between fragments and files since > fragments will no longer include references to scan properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8542) [Release] Fix checksum url in the website post release script
[ https://issues.apache.org/jira/browse/ARROW-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8542: -- Labels: pull-request-available (was: ) > [Release] Fix checksum url in the website post release script > - > > Key: ARROW-8542 > URL: https://issues.apache.org/jira/browse/ARROW-8542 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The issue was captured here > https://github.com/apache/arrow-site/pull/53#discussion_r411728907 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8542) [Release] Fix checksum url in the website post release script
Krisztian Szucs created ARROW-8542: -- Summary: [Release] Fix checksum url in the website post release script Key: ARROW-8542 URL: https://issues.apache.org/jira/browse/ARROW-8542 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 The issue was captured here https://github.com/apache/arrow-site/pull/53#discussion_r411728907 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8541) [Release] Don't remove previous source releases automatically
[ https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8541: -- Labels: pull-request-available (was: ) > [Release] Don't remove previous source releases automatically > - > > Key: ARROW-8541 > URL: https://issues.apache.org/jira/browse/ARROW-8541 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We should keep at least the last three source tarballs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8541) [Release] Don't remove previous source releases automatically
[ https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088606#comment-17088606 ] Antoine Pitrou commented on ARROW-8541: --- I see all source releases here: https://archive.apache.org/dist/arrow/ > [Release] Don't remove previous source releases automatically > - > > Key: ARROW-8541 > URL: https://issues.apache.org/jira/browse/ARROW-8541 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 1.0.0 > > > We should keep at least the last three source tarballs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8541) [Release] Don't remove previous source releases automatically
Krisztian Szucs created ARROW-8541: -- Summary: [Release] Don't remove previous source releases automatically Key: ARROW-8541 URL: https://issues.apache.org/jira/browse/ARROW-8541 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 We should keep at least the last three source tarballs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-8536: --- Fix Version/s: 1.0.0 > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Priority: Critical > Fix For: 1.0.0 > > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". > The workaround is to create a directoy `/format` in the root of your file > system and place the Flight.proto file there. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory
[ https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-8536: --- Priority: Critical (was: Major) > [Rust] Failed to locate format/Flight.proto in any parent directory > --- > > Key: ARROW-8536 > URL: https://issues.apache.org/jira/browse/ARROW-8536 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.17.0 >Reporter: Andy Grove >Priority: Critical > > When using Arrow 0.17.0 as a dependency, it is likely that you will get the > error "Failed to locate format/Flight.proto in any parent directory". > The workaround is to create a directoy `/format` in the root of your file > system and place the Flight.proto file there. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8540) [C++] Create memory allocation benchmark
[ https://issues.apache.org/jira/browse/ARROW-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8540: -- Labels: pull-request-available (was: ) > [C++] Create memory allocation benchmark > > > Key: ARROW-8540 > URL: https://issues.apache.org/jira/browse/ARROW-8540 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > To judge the overhead of repeated allocations and deallocations (e.g. for > temporary computation results). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8540) [C++] Create memory allocation benchmark
[ https://issues.apache.org/jira/browse/ARROW-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-8540: - Assignee: Antoine Pitrou > [C++] Create memory allocation benchmark > > > Key: ARROW-8540 > URL: https://issues.apache.org/jira/browse/ARROW-8540 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > > To judge the overhead of repeated allocations and deallocations (e.g. for > temporary computation results). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8540) [C++] Create memory allocation benchmark
[ https://issues.apache.org/jira/browse/ARROW-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8540: -- Description: To judge the overhead of repeated allocations and deallocations (e.g. for temporary computation results). (was: To judge of overhead of repeated allocations and deallocations (e.g. for temporary computation results).) > [C++] Create memory allocation benchmark > > > Key: ARROW-8540 > URL: https://issues.apache.org/jira/browse/ARROW-8540 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Minor > > To judge the overhead of repeated allocations and deallocations (e.g. for > temporary computation results). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8540) [C++] Create memory allocation benchmark
Antoine Pitrou created ARROW-8540: - Summary: [C++] Create memory allocation benchmark Key: ARROW-8540 URL: https://issues.apache.org/jira/browse/ARROW-8540 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou To judge of overhead of repeated allocations and deallocations (e.g. for temporary computation results). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088530#comment-17088530 ] Yosuke Shiro commented on ARROW-8539: - This caused by llvm@8(Homebrew). I sent https://github.com/Homebrew/homebrew-core/pull/53445. If this PR has been merged, I'll check if CI passes. > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Priority: Critical > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro reassigned ARROW-8539: --- Assignee: Yosuke Shiro > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Assignee: Yosuke Shiro >Priority: Critical > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
[ https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088499#comment-17088499 ] Antoine Pitrou commented on ARROW-8539: --- cc [~kou] > [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails > -- > > Key: ARROW-8539 > URL: https://issues.apache.org/jira/browse/ARROW-8539 > Project: Apache Arrow > Issue Type: Bug > Components: C, Continuous Integration, GLib >Reporter: Antoine Pitrou >Priority: Critical > > See e.g. > https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 > {code} > [192/256] Generating arithmetic_ops.bc > FAILED: src/gandiva/precompiled/arithmetic_ops.bc > cd > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled > && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env > SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk > /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG > -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c > /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc > -o > /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc > -isysroot > /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk > -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src > dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib > Referenced from: /usr/local/opt/llvm@8/bin/clang-8 > Reason: image not found > Child aborted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
Antoine Pitrou created ARROW-8539: - Summary: [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails Key: ARROW-8539 URL: https://issues.apache.org/jira/browse/ARROW-8539 Project: Apache Arrow Issue Type: Bug Components: C, Continuous Integration, GLib Reporter: Antoine Pitrou See e.g. https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868 {code} [192/256] Generating arithmetic_ops.bc FAILED: src/gandiva/precompiled/arithmetic_ops.bc cd /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc -o /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc -isysroot /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib Referenced from: /usr/local/opt/llvm@8/bin/clang-8 Reason: image not found Child aborted {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8534) [C++][CSV] Issue building CSV component under GCC 6.1.0
[ https://issues.apache.org/jira/browse/ARROW-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088410#comment-17088410 ] Antoine Pitrou commented on ARROW-8534: --- Could you please open a PR? > [C++][CSV] Issue building CSV component under GCC 6.1.0 > --- > > Key: ARROW-8534 > URL: https://issues.apache.org/jira/browse/ARROW-8534 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.17.0 > Environment: Centos 7 x86_64 >Reporter: Ross Wolfson >Priority: Minor > > Hi, > In the current version (0.17.0), it seems that CSV reader.cc fails to compile > when using GCC 6.1.0. This builds when using older or newer GCC versions (we > tested with 4.8.5, 8.2.0 and 9.3.0). > > {{[root@1d4fcfc2580e arrow_src]# /ourgcc/gcc-6.1.0/bin/g++ -c > cpp/src/arrow/csv/reader.cc -I cpp/src}} > {{cpp/src/arrow/csv/reader.cc: In constructor > 'arrow::csv::SerialBlockReader::SerialBlockReader(std::unique_ptr, > arrow::Iterator >, > std::shared_ptr)':}} > {{cpp/src/arrow/csv/reader.cc:178:22: error: use of deleted function > 'std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) > [with _Tp = arrow::Chunker; _Dp = std::default_delete]'}} > {{ using BlockReader::BlockReader;}} > {{ ^~~}} > {{In file included from /ourgcc/gcc-6.1.0/include/c++/6.1.0/memory:81:0,}} > {{ from cpp/src/arrow/csv/reader.h:20,}} > {{ from cpp/src/arrow/csv/reader.cc:18:}} > {{/ourgcc/gcc-6.1.0/include/c++/6.1.0/bits/unique_ptr.h:356:7: note: declared > here}} > {{ unique_ptr(const unique_ptr&) = delete;}} > {{ ^~}} > {{cpp/src/arrow/csv/reader.cc:178:22: error: use of deleted function > 'arrow::Iterator >::Iterator(const > arrow::Iterator >&)'}} > {{ using BlockReader::BlockReader;}} > {{ ^~~}} > {{In file included from cpp/src/arrow/csv/reader.cc:43:0:}} > {{cpp/src/arrow/util/iterator.h:63:7: note: > 'arrow::Iterator >::Iterator(const > arrow::Iterator >&)' is implicitly deleted > because the default definition would be ill-formed:}} > {{ class Iterator : public util::EqualityComparable> {}} > {{ ^~~~}} > {{cpp/src/arrow/util/iterator.h:63:7: error: use of deleted function > 'std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) > [with _Tp = void; _Dp = void (*)(void*)]'}} > {{In file included from /ourgcc/gcc-6.1.0/include/c++/6.1.0/memory:81:0,}} > {{ from cpp/src/arrow/csv/reader.h:20,}} > {{ from cpp/src/arrow/csv/reader.cc:18:}} > {{/ourgcc/gcc-6.1.0/include/c++/6.1.0/bits/unique_ptr.h:356:7: note: declared > here}} > {{ unique_ptr(const unique_ptr&) = delete;}} > {{ ^~}} > {{cpp/src/arrow/csv/reader.cc: In member function 'virtual > arrow::Result > > arrow::csv::SerialTableReader::Read()':}} > {{cpp/src/arrow/csv/reader.cc:750:88: note: synthesized method > 'arrow::csv::SerialBlockReader::SerialBlockReader(std::unique_ptr, > arrow::Iterator >, > std::shared_ptr)' first required here}} > {{ std::move(buffer_iterator_), std::move(first_buffer));}} > {{ ^}} > {{cpp/src/arrow/csv/reader.cc: In constructor > 'arrow::csv::ThreadedBlockReader::ThreadedBlockReader(std::unique_ptr, > arrow::Iterator >, > std::shared_ptr)':}} > {{cpp/src/arrow/csv/reader.cc:221:22: error: use of deleted function > 'std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) > [with _Tp = arrow::Chunker; _Dp = std::default_delete]'}} > {{ using BlockReader::BlockReader;}} > {{ ^~~}} > {{In file included from /ourgcc/gcc-6.1.0/include/c++/6.1.0/memory:81:0,}} > {{ from cpp/src/arrow/csv/reader.h:20,}} > {{ from cpp/src/arrow/csv/reader.cc:18:}} > {{/ourgcc/gcc-6.1.0/include/c++/6.1.0/bits/unique_ptr.h:356:7: note: declared > here}} > {{ unique_ptr(const unique_ptr&) = delete;}} > {{ ^~}} > {{cpp/src/arrow/csv/reader.cc:221:22: error: use of deleted function > 'arrow::Iterator >::Iterator(const > arrow::Iterator >&)'}} > {{ using BlockReader::BlockReader;}} > {{ ^~~}} > {{cpp/src/arrow/csv/reader.cc: In member function 'virtual > arrow::Result > > arrow::csv::ThreadedTableReader::Read()':}} > {{cpp/src/arrow/csv/reader.cc:815:61: note: synthesized method > 'arrow::csv::ThreadedBlockReader::ThreadedBlockReader(std::unique_ptr, > arrow::Iterator >, > std::shared_ptr)' first required here}} > {{ std::move(first_buffer));}} > {{ ^}} > > My colleague found a workaround that avoids the build error, however, we are > not clear if this is the "best" fix. > {{--- a/cpp/src/arrow/csv/reader.cc}} > {{+++ b/cpp/src/arrow/csv/reader.cc}} > {{@@ -175,7 +175,12 @@ class BlockReader {}} > {{ // using CSVBlock::consume_bytes.}} > {{ class SerialBlockReader : public BlockReader {}} > {{ public:}} > {{- using BlockReader::BlockReader;}} > {{+ SerialBlockReader(std::unique_ptr chunker,}} > {{+ Iterator>
[jira] [Commented] (ARROW-6976) Possible memory leak in pyarrow read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088344#comment-17088344 ] Joris Van den Bossche commented on ARROW-6976: -- [~Athlete_369] that can be possible, depending on your file. Parquet can be highly compressed, so giving a big difference between file size and size in memory for pandas. You can check the memory usage of your resulting pandas DataFrame with {{df.info(memory_usage="deep")}}. How much does that indicate? > Possible memory leak in pyarrow read_parquet > > > Key: ARROW-6976 > URL: https://issues.apache.org/jira/browse/ARROW-6976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: linux ubuntu 18.04 >Reporter: david cottrell >Priority: Critical > Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, > pyarrow_0150.png > > > > Version and repro info in the gist below. > Not sure if I'm not understanding something from this > [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] > but there seems to be memory accumulation when that is exacerbated with > higher arity objects like strings and dates (not datetimes). > > I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed > to "fix" or lessen the problem. > > [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] > > Let me know if this post should go elsewhere. > !image-2019-10-23-16-17-20-739.png! > > {code:java} > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8537) [C++] Performance regression from ARROW-8523
[ https://issues.apache.org/jira/browse/ARROW-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088338#comment-17088338 ] Yibo Cai commented on ARROW-8537: - Add analysis: https://github.com/apache/arrow/pull/6986#issuecomment-616978252 > [C++] Performance regression from ARROW-8523 > > > Key: ARROW-8537 > URL: https://issues.apache.org/jira/browse/ARROW-8537 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yibo Cai >Priority: Major > > I optimized BitmapReader in [this > PR|https://github.com/apache/arrow/pull/6986] and see performance uplift of > BitmapReader test case. I didn't check other test cases as this change looks > trivial. > I reviewed all test cases just now and see big performance drop of 4 cases, > details at [PR > link|https://github.com/apache/arrow/pull/6986#issuecomment-616915079]. > I also compared performance of code using BitmapReader, no obvious changes > found. Looks we should revert that PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)