[jira] [Commented] (ARROW-2496) [C++] Add support for Libhdfs++

2019-11-14 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974463#comment-16974463
 ] 

Deepak Majeti commented on ARROW-2496:
--

Not anytime soon. We are still working on improving the robustness of Libhdfs++.

> [C++] Add support for Libhdfs++
> ---
>
> Key: ARROW-2496
> URL: https://issues.apache.org/jira/browse/ARROW-2496
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: HDFS
>
> Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
> project. Details are available here.
> https://issues.apache.org/jira/browse/HDFS-8707
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4967) [C++] Parquet: Object type and stats lost when using 96-bit timestamps

2019-08-21 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912811#comment-16912811
 ] 

Deepak Majeti commented on ARROW-4967:
--

The comments above are correct! INT96 type is deprecated and it statistics are 
disabled by default. The timestamp byte layout in INT96 is big endian and does 
not comply with the standard sort orders in the spec.

> [C++] Parquet: Object type and stats lost when using 96-bit timestamps
> --
>
> Key: ARROW-4967
> URL: https://issues.apache.org/jira/browse/ARROW-4967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
>Reporter: Diego Argueta
>Priority: Minor
>  Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes 
> an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema 
> 
> foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4 
> 
> foo:  INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 
> 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. 
> No object type, and no column stats.
> {code}
> file schema: schema 
> 
> foo: OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4 
> 
> foo:  INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look 
> differently depending on an unrelated flag being set or cleared.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6180) [C++] Create InputStream that references a segment of a RandomAccessFile

2019-08-13 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906653#comment-16906653
 ] 

Deepak Majeti commented on ARROW-6180:
--

[~wesmckinn], [~pitrou] looks like the issue can also happen in a 
single-threaded application when you read multiple column chunks in a 
row-by-row fashion from each chunk. You end up creating a BufferedInputStream 
for each chunk and all these InputStreams share the same RandomAccessFile and 
thereby getting clobbered.

> [C++] Create InputStream that references a segment of a RandomAccessFile
> 
>
> Key: ARROW-6180
> URL: https://issues.apache.org/jira/browse/ARROW-6180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>
> If different threads wants to do buffered reads over different portions of a 
> file (and they are unable to create their own separate file handles), they 
> may clobber each other. I would propose creating an object that keeps the 
> RandomAccessFile internally and implements the InputStream API in a way that 
> is safe from other threads changing the file position



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5524) [C++] Turn off PARQUET_BUILD_ENCRYPTION in CMake if OpenSSL not found

2019-06-16 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved ARROW-5524.
--
   Resolution: Fixed
Fix Version/s: 0.14.0

> [C++] Turn off PARQUET_BUILD_ENCRYPTION in CMake if OpenSSL not found
> -
>
> Key: ARROW-5524
> URL: https://issues.apache.org/jira/browse/ARROW-5524
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5538) [C++] Restrict minimum OpenSSL version to 1.0.2

2019-06-11 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861702#comment-16861702
 ] 

Deepak Majeti commented on ARROW-5538:
--

API compatibility is another motivation for this as well. 1.0.2 onwards has 
calls used in the current implementation.

> [C++] Restrict minimum OpenSSL version to 1.0.2
> ---
>
> Key: ARROW-5538
> URL: https://issues.apache.org/jira/browse/ARROW-5538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>
> We must enable encryption support in Arrow only if the OpenSSL version is at 
> least 1.0.2. The official documentation prohibits using older versions.
> [https://www.openssl.org/source/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5538) [C++] Restrict minimum OpenSSL version to 1.0.2

2019-06-10 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-5538:


 Summary: [C++] Restrict minimum OpenSSL version to 1.0.2
 Key: ARROW-5538
 URL: https://issues.apache.org/jira/browse/ARROW-5538
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Deepak Majeti
Assignee: Deepak Majeti


We must enable encryption support in Arrow only if the OpenSSL version is at 
least 1.0.2. The official documentation prohibits using older versions.

[https://www.openssl.org/source/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5241) [Python] Add option to disable writing statistics

2019-04-29 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-5241:


 Summary: [Python] Add option to disable writing statistics
 Key: ARROW-5241
 URL: https://issues.apache.org/jira/browse/ARROW-5241
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Deepak Majeti
 Fix For: 0.14.0


C++  Parquet API exposes an option to disable writing statistics when writing a 
Parquet file.
It will be useful to expose this API in the Python Arrow API as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829738#comment-16829738
 ] 

Deepak Majeti commented on ARROW-4139:
--

The statistics are fixed for UTF-8 types. 
https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/types.cc#L260
https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L140
 is an out of date comment and must be fixed.
See comment here 
https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L558

> [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is 
> set
> ---
>
> Key: ARROW-4139
> URL: https://issues.apache.org/jira/browse/ARROW-4139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Matthew Rocklin
>Priority: Minor
>  Labels: parquet, pull-request-available, python
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When writing Pandas data to Parquet format and reading it back again I find 
> that that statistics of text columns are stored as byte arrays rather than as 
> unicode text. 
> I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding 
> of how best to manage statistics.  (I'd be quite happy to learn that it was 
> the latter).
> Here is a minimal example
> {code:python}
> import pandas as pd
> df = pd.DataFrame({'x': ['a']})
> df.to_parquet('df.parquet')
> import pyarrow.parquet as pq
> pf = pq.ParquetDataset('df.parquet')
> piece = pf.pieces[0]
> rg = piece.row_group(0)
> md = piece.get_metadata(pq.ParquetFile)
> rg = md.row_group(0)
> c = rg.column(0)
> >>> c
> 
>   file_offset: 63
>   file_path: 
>   physical_type: BYTE_ARRAY
>   num_values: 1
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
> 
>   has_min_max: True
>   min: b'a'
>   max: b'a'
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
>   has_dictionary_page: True
>   dictionary_page_offset: 4
>   data_page_offset: 25
>   total_compressed_size: 59
>   total_uncompressed_size: 55
> >>> type(c.statistics.min)
> bytes
> {code}
> My guess is that we would want to store a logical type in the statistics like 
> UNICODE, though I don't have enough experience with Parquet data types to 
> know if this is a good idea or possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5218) [C++] Improve build when third-party library locations are specified

2019-04-25 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-5218:


 Summary: [C++] Improve build when third-party library locations 
are specified 
 Key: ARROW-5218
 URL: https://issues.apache.org/jira/browse/ARROW-5218
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Deepak Majeti
Assignee: Deepak Majeti


The current CMake build system does not handle user specified third-party 
library locations well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5008) [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist

2019-03-28 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804505#comment-16804505
 ] 

Deepak Majeti edited comment on ARROW-5008 at 3/29/19 1:29 AM:
---

However, installing tzdata is recommended since timezone information is 
constantly updated.


was (Author: mdeepak):
However, installing tzdata is the recommended since timezone information is 
constantly updated.

> [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist
> 
>
> Key: ARROW-5008
> URL: https://issues.apache.org/jira/browse/ARROW-5008
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.12.1
>Reporter: Keith Kraus
>Priority: Major
>
> In docker containers it's common for `/etc/localtime` to not exist, and if it 
> doesn't exist it causes a file not found error which is not handled in 
> PyArrow. Workaround is to install `tzdata` into the container (at least for 
> Ubuntu), but wanted to report upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5008) [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist

2019-03-28 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804505#comment-16804505
 ] 

Deepak Majeti commented on ARROW-5008:
--

However, installing tzdata is the recommended since timezone information is 
constantly updated.

> [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist
> 
>
> Key: ARROW-5008
> URL: https://issues.apache.org/jira/browse/ARROW-5008
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.12.1
>Reporter: Keith Kraus
>Priority: Major
>
> In docker containers it's common for `/etc/localtime` to not exist, and if it 
> doesn't exist it causes a file not found error which is not handled in 
> PyArrow. Workaround is to install `tzdata` into the container (at least for 
> Ubuntu), but wanted to report upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5008) [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist

2019-03-28 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804503#comment-16804503
 ] 

Deepak Majeti commented on ARROW-5008:
--

The ORC library does depend on the presence of '/etc/localtime'.
Another workaround for Ubuntu could be something like
{noformat}
ln -fs /usr/share/zoneinfo/Etc/UTC /etc/localtime
{noformat}

ORC-218 aims to handle such systems where there are no timezone files.

> [Python] ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist
> 
>
> Key: ARROW-5008
> URL: https://issues.apache.org/jira/browse/ARROW-5008
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.12.1
>Reporter: Keith Kraus
>Priority: Major
>
> In docker containers it's common for `/etc/localtime` to not exist, and if it 
> doesn't exist it causes a file not found error which is not handled in 
> PyArrow. Workaround is to install `tzdata` into the container (at least for 
> Ubuntu), but wanted to report upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4302) [C++] Add OpenSSL to C++ build toolchain

2019-01-23 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned ARROW-4302:


Assignee: Deepak Majeti

> [C++] Add OpenSSL to C++ build toolchain
> 
>
> Key: ARROW-4302
> URL: https://issues.apache.org/jira/browse/ARROW-4302
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> This is needed for encryption support for Parquet, among other things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4302) [C++] Add OpenSSL to C++ build toolchain

2019-01-23 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750176#comment-16750176
 ] 

Deepak Majeti commented on ARROW-4302:
--

[~thamha], [~James C] and I can help with the OpenSSL part. We have some 
experience with this from the Libhdfs++ project. Do you want to share your code 
first in a PR? We can then add support for OpenSSL.

> [C++] Add OpenSSL to C++ build toolchain
> 
>
> Key: ARROW-4302
> URL: https://issues.apache.org/jira/browse/ARROW-4302
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> This is needed for encryption support for Parquet, among other things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4293) [C++] Can't access parquet statistics on binary columns

2019-01-18 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746555#comment-16746555
 ] 

Deepak Majeti commented on ARROW-4293:
--

This should be a Parquet JIRA. [~wesmckinn] Can we move this Jira to the 
Parquet project?

{{HasCorrectStatistics()}} has to be updated to accept all statistics written 
by parquet-mr 1.10.0

parquet-mr implemented the new fixed min-max statistics in the following Jira 
that went into the 1.10.0 release

https://issues.apache.org/jira/browse/PARQUET-1025

> [C++] Can't access parquet statistics on binary columns
> ---
>
> Key: ARROW-4293
> URL: https://issues.apache.org/jira/browse/ARROW-4293
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Ildar
>Priority: Major
>
> Hi,
> I'm trying to use per-column statistics (min/max values) to filter out row 
> groups while reading parquet file. But I don't see statistics built for 
> binary columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} 
> discards statistics that have sort order {{UNSIGNED and haven't been created 
> by parquet-cpp}}. As I understand there used to be some issues in 
> {{parquet-mr}} before. But do they still persist?
> For example, I have parquet file created with {{parquet-mr}} version 1.10, it 
> seems to have correct min/max values for binary columns. And {{parquet-cpp}} 
> works fine for me if I remove this code from {{HasCorrectStatistics()}} func:
>  
> {code:java}
> if (SortOrder::SIGNED != sort_order && !max_equals_min) {
>     return false;
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2496) [C++] Add support for Libhdfs++

2018-04-23 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448207#comment-16448207
 ] 

Deepak Majeti edited comment on ARROW-2496 at 4/23/18 2:19 PM:
---

CC [~James C], who is one of the main developers of Libhdfs++


was (Author: mdeepak):
CC [~James C]

> [C++] Add support for Libhdfs++
> ---
>
> Key: ARROW-2496
> URL: https://issues.apache.org/jira/browse/ARROW-2496
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>
> Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
> project. Details are available here.
> https://issues.apache.org/jira/browse/HDFS-8707
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2496) [C++] Add support for Libhdfs++

2018-04-23 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448207#comment-16448207
 ] 

Deepak Majeti commented on ARROW-2496:
--

CC [~James C]

> [C++] Add support for Libhdfs++
> ---
>
> Key: ARROW-2496
> URL: https://issues.apache.org/jira/browse/ARROW-2496
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>
> Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
> project. Details are available here.
> https://issues.apache.org/jira/browse/HDFS-8707
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2496) [C++] Add support for Libhdfs++

2018-04-23 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-2496:


 Summary: [C++] Add support for Libhdfs++
 Key: ARROW-2496
 URL: https://issues.apache.org/jira/browse/ARROW-2496
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Deepak Majeti
Assignee: Deepak Majeti


Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
project. Details are available here.

https://issues.apache.org/jira/browse/HDFS-8707

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2247) [Python] Statically-linking boost_regex in both libarrow and libparquet results in segfault

2018-04-06 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428528#comment-16428528
 ] 

Deepak Majeti commented on ARROW-2247:
--

https://issues.apache.org/jira/browse/PARQUET-1265 probably is a fix for this 
issue.

> [Python] Statically-linking boost_regex in both libarrow and libparquet 
> results in segfault
> ---
>
> Key: ARROW-2247
> URL: https://issues.apache.org/jira/browse/ARROW-2247
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Priority: Major
>
> This is a backtrace loading {{libparquet.so}} on Ubuntu 14.04 using boost 
> 1.66.1 from conda-forge. Both libarrow and libparquet contain {{boost_regex}} 
> statically linked. 
> {code}
> In [1]: import ctypes
> In [2]: ctypes.CDLL('libparquet.so')
> Program received signal SIGSEGV, Segmentation fault.
> 0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> (gdb) bt
> #0  0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #1  0x7fffed74c1fc in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() ()
>from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #2  0x7fffed794803 in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #3  0x7fffed79e62b in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #4  0x7fffee58561b in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p1=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  
> p2=0x7fffee60064a "", f=0) at 
> /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x7fffee5855a7 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x7fffee5683f3 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x7fffee5656d0 in parquet::ApplicationVersion::ApplicationVersion (
> Python Exception  There is no member named _M_dataplus.: 
> this=0x7fffee8f1fb8 
> , created_by=)
> at ../src/parquet/metadata.cc:452
> #8  0x7fffee41c271 in __cxx_global_var_init.1(void) () at 
> ../src/parquet/metadata.cc:35
> #9  0x7fffee41c44e in _GLOBAL__sub_I_metadata.tmp.wesm_desktop.4838.ii ()
>from /home/wesm/local/lib/libparquet.so
> #10 0x77dea1da in call_init (l=, argc=argc@entry=2, 
> argv=argv@entry=0x7fff5d88, 
> env=env@entry=0x7fff5da0) at dl-init.c:78
> #11 0x77dea2c3 in call_init (env=, argv= out>, argc=, 
> l=) at dl-init.c:36
> #12 _dl_init (main_map=main_map@entry=0x13fb220, argc=2, argv=0x7fff5d88, 
> env=0x7fff5da0)
> at dl-init.c:126
> {code}
> This seems to be caused by static initializations in libparquet:
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L34
> We should see if removing these static initializations makes the problem go 
> away. If not, then statically-linking boost_regex in both libraries is not 
> advisable.
> For this reason and more, I really wish that Arrow and Parquet shared a 
> common build system and monorepo structure -- it would make handling these 
> toolchain and build-related issues much simpler. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2247) [Python] Statically-linking boost_regex in both libarrow and libparquet results in segfault

2018-03-05 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386198#comment-16386198
 ] 

Deepak Majeti commented on ARROW-2247:
--

If the proposed soultion solves linking issues such as this, we should go 
forward with it.

> [Python] Statically-linking boost_regex in both libarrow and libparquet 
> results in segfault
> ---
>
> Key: ARROW-2247
> URL: https://issues.apache.org/jira/browse/ARROW-2247
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Priority: Major
>
> This is a backtrace loading {{libparquet.so}} on Ubuntu 14.04 using boost 
> 1.66.1 from conda-forge. Both libarrow and libparquet contain {{boost_regex}} 
> statically linked. 
> {code}
> In [1]: import ctypes
> In [2]: ctypes.CDLL('libparquet.so')
> Program received signal SIGSEGV, Segmentation fault.
> 0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> (gdb) bt
> #0  0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #1  0x7fffed74c1fc in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() ()
>from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #2  0x7fffed794803 in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #3  0x7fffed79e62b in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #4  0x7fffee58561b in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p1=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  
> p2=0x7fffee60064a "", f=0) at 
> /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x7fffee5855a7 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x7fffee5683f3 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x7fffee5656d0 in parquet::ApplicationVersion::ApplicationVersion (
> Python Exception  There is no member named _M_dataplus.: 
> this=0x7fffee8f1fb8 
> , created_by=)
> at ../src/parquet/metadata.cc:452
> #8  0x7fffee41c271 in __cxx_global_var_init.1(void) () at 
> ../src/parquet/metadata.cc:35
> #9  0x7fffee41c44e in _GLOBAL__sub_I_metadata.tmp.wesm_desktop.4838.ii ()
>from /home/wesm/local/lib/libparquet.so
> #10 0x77dea1da in call_init (l=, argc=argc@entry=2, 
> argv=argv@entry=0x7fff5d88, 
> env=env@entry=0x7fff5da0) at dl-init.c:78
> #11 0x77dea2c3 in call_init (env=, argv= out>, argc=, 
> l=) at dl-init.c:36
> #12 _dl_init (main_map=main_map@entry=0x13fb220, argc=2, argv=0x7fff5d88, 
> env=0x7fff5da0)
> at dl-init.c:126
> {code}
> This seems to be caused by static initializations in libparquet:
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L34
> We should see if removing these static initializations makes the problem go 
> away. If not, then statically-linking boost_regex in both libraries is not 
> advisable.
> For this reason and more, I really wish that Arrow and Parquet shared a 
> common build system and monorepo structure -- it would make handling these 
> toolchain and build-related issues much simpler. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2247) [Python] Statically-linking boost_regex in both libarrow and libparquet results in segfault

2018-03-03 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384877#comment-16384877
 ] 

Deepak Majeti commented on ARROW-2247:
--

Interesting that the same boost version is causing an issue. Could it be an 
issue with the CDLL  python call?

Having a mono repo for Arrow and Parquet definitely simplifies the build 
toolchain. But, this particular problem can only be solved if we have a single 
library consisting of both Arrow and Parquet correct?

> [Python] Statically-linking boost_regex in both libarrow and libparquet 
> results in segfault
> ---
>
> Key: ARROW-2247
> URL: https://issues.apache.org/jira/browse/ARROW-2247
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Priority: Major
>
> This is a backtrace loading {{libparquet.so}} on Ubuntu 14.04 using boost 
> 1.66.1 from conda-forge. Both libarrow and libparquet contain {{boost_regex}} 
> statically linked. 
> {code}
> In [1]: import ctypes
> In [2]: ctypes.CDLL('libparquet.so')
> Program received signal SIGSEGV, Segmentation fault.
> 0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> (gdb) bt
> #0  0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #1  0x7fffed74c1fc in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() ()
>from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #2  0x7fffed794803 in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #3  0x7fffed79e62b in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #4  0x7fffee58561b in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p1=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  
> p2=0x7fffee60064a "", f=0) at 
> /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x7fffee5855a7 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x7fffee5683f3 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x7fffee5656d0 in parquet::ApplicationVersion::ApplicationVersion (
> Python Exception  There is no member named _M_dataplus.: 
> this=0x7fffee8f1fb8 
> , created_by=)
> at ../src/parquet/metadata.cc:452
> #8  0x7fffee41c271 in __cxx_global_var_init.1(void) () at 
> ../src/parquet/metadata.cc:35
> #9  0x7fffee41c44e in _GLOBAL__sub_I_metadata.tmp.wesm_desktop.4838.ii ()
>from /home/wesm/local/lib/libparquet.so
> #10 0x77dea1da in call_init (l=, argc=argc@entry=2, 
> argv=argv@entry=0x7fff5d88, 
> env=env@entry=0x7fff5da0) at dl-init.c:78
> #11 0x77dea2c3 in call_init (env=, argv= out>, argc=, 
> l=) at dl-init.c:36
> #12 _dl_init (main_map=main_map@entry=0x13fb220, argc=2, argv=0x7fff5d88, 
> env=0x7fff5da0)
> at dl-init.c:126
> {code}
> This seems to be caused by static initializations in libparquet:
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L34
> We should see if removing these static initializations makes the problem go 
> away. If not, then statically-linking boost_regex in both libraries is not 
> advisable.
> For this reason and more, I really wish that Arrow and Parquet shared a 
> common build system and monorepo structure -- it would make handling these 
> toolchain and build-related issues much simpler. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1536) [C++] Do not transitively depend on libboost_system

2017-09-16 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169176#comment-16169176
 ] 

Deepak Majeti commented on ARROW-1536:
--

[~wesmckinn] I made an attempt here https://github.com/apache/arrow/pull/1105. 
Please let me know what you think. Thanks!

> [C++] Do not transitively depend on libboost_system
> ---
>
> Key: ARROW-1536
> URL: https://issues.apache.org/jira/browse/ARROW-1536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We picked up this dependency recently. I don't think this is a blocker for 
> 0.7.0, but it impacts static linkers (e.g. linkers of parquet-cpp)
> This was introduced in ARROW-1339 
> https://github.com/apache/arrow/commit/94b7cfafae0bda8f68ee3e5e9702c954b0116203
> cc [~mdeepak]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1186) [C++] Enable option to build arrow with minimal dependencies needed to build Parquet library

2017-07-06 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077009#comment-16077009
 ] 

Deepak Majeti commented on ARROW-1186:
--

PR: https://github.com/apache/arrow/pull/815

> [C++] Enable option to build arrow with minimal dependencies needed to build 
> Parquet library
> 
>
> Key: ARROW-1186
> URL: https://issues.apache.org/jira/browse/ARROW-1186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Deepak Majeti
>
> As discussed in https://github.com/apache/parquet-cpp/pull/366, the scope of 
> this JIRA is to enable a build option in Arrow to enable minimal transitive 
> dependencies required for Parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1186) [C++] Enable option to build arrow with minimal dependencies needed to build Parquet library

2017-07-05 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-1186:


 Summary: [C++] Enable option to build arrow with minimal 
dependencies needed to build Parquet library
 Key: ARROW-1186
 URL: https://issues.apache.org/jira/browse/ARROW-1186
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Deepak Majeti


As discussed in https://github.com/apache/parquet-cpp/pull/366, the scope of 
this JIRA is to enable a build option in Arrow to enable minimal transitive 
dependencies required for Parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-820) [C++] Build dependencies for Parquet library without arrow support

2017-04-13 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968304#comment-15968304
 ] 

Deepak Majeti commented on ARROW-820:
-

Will post a PR shortly.

> [C++] Build dependencies for Parquet library without arrow support
> --
>
> Key: ARROW-820
> URL: https://issues.apache.org/jira/browse/ARROW-820
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>
> Parquet C++ library without Arrow depends only on a subset of Arrow 
> components(buffers, io). The scope of this JIRA is to build libarrow with 
> minimal dependencies for users of Parquet C++ library without Arrow support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-820) [C++] Build dependencies for Parquet library without arrow support

2017-04-13 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-820:
---

 Summary: [C++] Build dependencies for Parquet library without 
arrow support
 Key: ARROW-820
 URL: https://issues.apache.org/jira/browse/ARROW-820
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Deepak Majeti


Parquet C++ library without Arrow depends only on a subset of Arrow 
components(buffers, io). The scope of this JIRA is to build libarrow with 
minimal dependencies for users of Parquet C++ library without Arrow support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)