date:20190801

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to category 
when we saved to local and loaded back.{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to category 
when we saved to local and loaded back.{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to category 
when we saved to local and loaded back.{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to object 
when we saved to local and loaded back.
{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to object 
when we saved to local and loaded back.
h5.{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}

h4. From the above output, we could see that the data type for age is int64 in 
the original pandas data frame but it got changed to object when we saved to 
local and loaded back.

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to object 
when we saved to local and loaded back.
{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}
h5. {color:#d04437}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to object 
when we saved to local and loaded back.
h5.{color}

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}

h4. From the above output, we could see that the data type for age is int64 in 
the original pandas data frame but it got changed to object when we saved to 
local and loaded back.

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}

>From the above output, we could see that the data type for age is int64 in the 
>original pandas data frame but it got changed to object when we saved to local 
>and loaded back.

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
{code}

>From the above output, we could see that the data type for age is int64 in the 
>original pandas data frame but it got changed to object when we saved to local 
>and loaded back.

*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object

h4. {color:#f79232}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to object 
when we saved to local and loaded back.
h4. {color}
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>

[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naga updated ARROW-6114:

Description: 
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object

h4. {color:#f79232}From the above output, we could see that the data type for 
age is int64 in the original pandas data frame but it got changed to object 
when we saved to local and loaded back.
h4. {color}
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1

  was:
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
>From the above output, we could see that the data type for age is int64 in the 
>original pandas data frame but it got changed to object when we saved to local 
>and loaded back.
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1


> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>

[jira] [Created] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-01 Thread Naga (JIRA)

Naga created ARROW-6114:
---

 Summary: Datatypes are not preserved when a pandas dataframe 
partitioned and saved as parquet file using pyarrow
 Key: ARROW-6114
 URL: https://issues.apache.org/jira/browse/ARROW-6114
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Python 3.7.3
pyarrow 0.14.1
Reporter: Naga


h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
>From the above output, we could see that the data type for age is int64 in the 
>original pandas data frame but it got changed to object when we saved to local 
>and loaded back.
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6113) [Java] Support vector deduplicate function

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6113:
--
Labels: pull-request-available  (was: )

> [Java] Support vector deduplicate function
> --
>
> Key: ARROW-6113
> URL: https://issues.apache.org/jira/browse/ARROW-6113
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> Remove adjacent deduplicated elements from a vector. This function can be 
> used, for example, in finding distinct values, or in compressing the vector 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6113) [Java] Support vector deduplicate function

2019-08-01 Thread Liya Fan (JIRA)

Liya Fan created ARROW-6113:
---

 Summary: [Java] Support vector deduplicate function
 Key: ARROW-6113
 URL: https://issues.apache.org/jira/browse/ARROW-6113
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Remove adjacent deduplicated elements from a vector. This function can be used, 
for example, in finding distinct values, or in compressing the vector data.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Ji Liu (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898536#comment-16898536
 ] 

Ji Liu edited comment on ARROW-6111 at 8/2/19 4:02 AM:
---

Sure, Feel free to take this yourself, IMO, just saw this was unassigned. :)
Also I would like to help if needed.


was (Author: tianchen92):
Sure, Feel free to take this yourself, IMO, just saw this was unassigned. :)

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Ji Liu (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898536#comment-16898536
 ] 

Ji Liu commented on ARROW-6111:
---

Sure, Feel free to take this yourself, IMO, just saw this was unassigned. :)

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898533#comment-16898533
 ] 

Micah Kornfield commented on ARROW-6111:


[~tianchen92] Please hold off on this until ARROW-6112 is either approved or we 
decide we don't want to do it.  I was thinking of doing this myself, since I've 
already done most of the work for LargeList and the changes are similar, but we 
can figure out division of work after details of ARROW-6112 are worked out..  
ARROW-750 has the details on the new types.

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6110:
--

Assignee: Micah Kornfield

> [Java] Support LargeList Type and add integration test with C++
> ---
>
> Key: ARROW-6110
> URL: https://issues.apache.org/jira/browse/ARROW-6110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Ji Liu (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898527#comment-16898527
 ] 

Ji Liu commented on ARROW-6111:
---

[~emkornfi...@gmail.com] Any more description? 

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Ji Liu (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu reassigned ARROW-6111:
-

Assignee: Ji Liu

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-01 Thread Micah Kornfield (JIRA)

Micah Kornfield created ARROW-6112:
--

 Summary: [Java] Update APIs to support 64-bit address space
 Key: ARROW-6112
 URL: https://issues.apache.org/jira/browse/ARROW-6112
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
Assignee: Micah Kornfield


The arrow spec allows for 64 bit address range for buffers (and arrays) we 
should support this at the API level in Java even if the current Netty backing 
buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)

Micah Kornfield created ARROW-6110:
--

 Summary: [Java] Support LargeList Type and add integration test 
with C++
 Key: ARROW-6110
 URL: https://issues.apache.org/jira/browse/ARROW-6110
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)

Micah Kornfield created ARROW-6111:
--

 Summary: [Java] Support LargeVarChar and LargeBinary types and add 
integration test with C++
 Key: ARROW-6111
 URL: https://issues.apache.org/jira/browse/ARROW-6111
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6109) [Integration] Docker image for integration testing can't be built on windows

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6109:
--
Labels: pull-request-available  (was: )

> [Integration] Docker image for integration testing can't be built on windows
> 
>
> Key: ARROW-6109
> URL: https://issues.apache.org/jira/browse/ARROW-6109
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Git for windows checks files out with windows line endings and converts them 
> back before checking them in.
> This causes issues in the bash scripts (which are copied from the windows 
> file system into the image) we use to build the 
> "arrow_integration_xenial_base" image when using docker on windows.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6109) [Integration] Docker image for integration testing can't be built on windows

2019-08-01 Thread Paddy Horan (JIRA)

Paddy Horan created ARROW-6109:
--

 Summary: [Integration] Docker image for integration testing can't 
be built on windows
 Key: ARROW-6109
 URL: https://issues.apache.org/jira/browse/ARROW-6109
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Paddy Horan
Assignee: Paddy Horan
 Fix For: 1.0.0


Git for windows checks files out with windows line endings and converts them 
back before checking them in.

This causes issues in the bash scripts (which are copied from the windows file 
system into the image) we use to build the "arrow_integration_xenial_base" 
image when using docker on windows.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

2019-08-01 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898482#comment-16898482
 ] 

Wes McKinney commented on ARROW-5480:
-

One slightly higher level issue is the extent to which we store Arrow schema 
information in the Parquet metadata. I have been thinking that we should 
actually store the whole serialized schema in the Parquet footer as an IPC 
message, so that we can refer to it when reading the file to set various read 
options

> [Python] Pandas categorical type doesn't survive a round-trip through parquet
> -
>
> Key: ARROW-5480
> URL: https://issues.apache.org/jira/browse/ARROW-5480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python: 3.7.3.final.0
> python-bits: 64
> OS: Linux
> OS-release: 5.0.0-15-generic
> machine: x86_64
> processor: x86_64
> byteorder: little
> pandas: 0.24.2
> numpy: 1.16.4
> pyarrow: 0.13.0
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Writing a string categorical variable to from pandas parquet is read back as 
> string (object dtype). I expected it to be read as category.
> The same thing happens if the category is numeric -- a numeric category is 
> read back as int64.
> In the code below, I tried out an in-memory arrow Table, which successfully 
> translates categories back to pandas. However, when I write to a parquet 
> file, it's not.
> In the scheme of things, this isn't a big deal, but it's a small surprise.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
> df.dtypes  # category
> # This works:
> pa.Table.from_pandas(df).to_pandas().dtypes  # category
> df.to_parquet("categories.parquet")
> # This reads back object, but I expected category
> pd.read_parquet("categories.parquet").dtypes  # object
> # Numeric categories have the same issue:
> df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
> df_num.dtypes # category
> pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
> df_num.to_parquet("categories_num.parquet")
> # This reads back int64, but I expected category
> pd.read_parquet("categories_num.parquet").dtypes  # int64
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-3652) [Python] CategoricalIndex is lost after reading back

2019-08-01 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898481#comment-16898481
 ] 

Wes McKinney commented on ARROW-3652:
-

I'm pretty close to being able to take care of this one finally. There are some 
Parquet-related oddities such as dealing with "large" categories if we want to 
accurately preserve categories in all cases.

> [Python] CategoricalIndex is lost after reading back
> 
>
> Key: ARROW-3652
> URL: https://issues.apache.org/jira/browse/ARROW-3652
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> When a {{CategoricalIndex}} is written and read back the resulting index is 
> not more categorical.
> {code}
> df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2'])
> df['c1'] = df['c1'].astype('category')
> df = df.set_index(['c1'])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> ref_df = pq.read_pandas('test.parquet').to_pandas()
> print(df.index)
> # CategoricalIndex(['a', 'c'], categories=['a', 'c'], ordered=False, 
> name='c1', dtype='category')
> print(ref_df.index)
> # Index(['a', 'c'], dtype='object', name='c1')
> {code}
> In the metadata the information is correctly contained:
> {code:java}
> {"name": "c1", "field_name": "c1", "p'
> b'andas_type": "categorical", "numpy_type": "int8", "metadata": 
> {"'
> b'num_categories": 2, "ordered": false}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (ARROW-6077) [C++][Parquet] Build logical schema tree mapping Arrow fields to Parquet schema levels

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6077.
-
   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 4971
[https://github.com/apache/arrow/pull/4971]

> [C++][Parquet] Build logical schema tree mapping Arrow fields to Parquet 
> schema levels
> --
>
> Key: ARROW-6077
> URL: https://issues.apache.org/jira/browse/ARROW-6077
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In several places in cpp/src/parquet/arrow, the {{FromParquetSchema}} 
> function is used to construct fields using a filtered "view" of the Parquet 
> schema. This is a hack caused by the lack of some kind of a "schema tree" 
> which maps Parquet concepts to Arrow {{Field}} objects. 
> One manifestation of this issue is that I was unable to implement dictionary 
> encoded subfields in cases like {{list}}, where you want the inner 
> field to be dictionary-encoded. 
> Patch forthcoming



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6108) [C++] Appveyor Build_Debug configuration is hanging in C++ unit tests

2019-08-01 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898473#comment-16898473
 ] 

Wes McKinney commented on ARROW-6108:
-

This is really weird, so here's a passing appveyor build after ARROW-6061

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26404339

The very next patch ARROW-6068 hangs

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26413929/job/sws48m0603ujwya1

But this patch had a green build on Antoine's PR for ARROW-6068

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26399506

> [C++] Appveyor Build_Debug configuration is hanging in C++ unit tests
> -
>
> Key: ARROW-6108
> URL: https://issues.apache.org/jira/browse/ARROW-6108
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> Not sure which patch introduced this, but here is one master build where it 
> occurs
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26413929/job/sws48m0603ujwya1
> The commit before this patch seems to have been OK



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (ARROW-6093) [Java] reduce branches in algo for first match in VectorRangeSearcher

2019-08-01 Thread Liya Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan reassigned ARROW-6093:
---

Assignee: Liya Fan

> [Java] reduce branches in algo for first match in VectorRangeSearcher
> -
>
> Key: ARROW-6093
> URL: https://issues.apache.org/jira/browse/ARROW-6093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Liya Fan
>Priority: Major
>
> This is a follow up Jira for the improvement suggested by [~fsaintjacques] in 
> the PR for 
> [https://github.com/apache/arrow/pull/4925]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6108) [C++] Appveyor Build_Debug configuration is hanging in C++ unit tests

2019-08-01 Thread Wes McKinney (JIRA)

Wes McKinney created ARROW-6108:
---

 Summary: [C++] Appveyor Build_Debug configuration is hanging in 
C++ unit tests
 Key: ARROW-6108
 URL: https://issues.apache.org/jira/browse/ARROW-6108
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


Not sure which patch introduced this, but here is one master build where it 
occurs

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26413929/job/sws48m0603ujwya1

The commit before this patch seems to have been OK



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers

2019-08-01 Thread Nick Poorman (JIRA)

Nick Poorman created ARROW-6107:
---

 Summary: [Go] ipc.Writer Option to skip appending data buffers
 Key: ARROW-6107
 URL: https://issues.apache.org/jira/browse/ARROW-6107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Nick Poorman


For cases where we have a known shared memory region, it would be great if the 
ipc.Writer (and by extension ipc.Reader?) had the ability to write out 
everything but the actual buffers holding the data. That way we can still 
utilize the ipc mechanisms to communicate without having to serialize all the 
underlying data across the wire.

 

This seems like it should be possible since the `RecordBatch` flatbuffers only 
contain the metadata and the underlying data buffers are appended later. We 
just need to skip appending the underlying data buffers.

 

[~sbinet] thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6106) Scala lang support

2019-08-01 Thread Boris V.Kuznetsov (JIRA)

Boris V.Kuznetsov created ARROW-6106:


 Summary: Scala lang support
 Key: ARROW-6106
 URL: https://issues.apache.org/jira/browse/ARROW-6106
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Boris V.Kuznetsov


I ported the testArrowStream.java to Scala Specs2 and added to the PR

Pls see more details in my [PR |https://github.com/apache/arrow/pull/4989]

I'm ready to port other tests as well and add SBT file

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-3325) [Python] Support reading Parquet binary/string columns directly as DictionaryArray

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3325:

Summary: [Python] Support reading Parquet binary/string columns directly as 
DictionaryArray  (was: [Python] Support reading Parquet binary/string columns 
as pandas Categorical)

> [Python] Support reading Parquet binary/string columns directly as 
> DictionaryArray
> --
>
> Key: ARROW-3325
> URL: https://issues.apache.org/jira/browse/ARROW-3325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Requires PARQUET-1324 and probably quite a bit of extra work  
> Properly implementing this will require dictionary normalization across row 
> groups. When reading a new row group, a fast path that compares the current 
> dictionary with the prior dictionary should be used. This also needs to 
> handle the case where a column chunk "fell back" to PLAIN encoding mid-stream



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (ARROW-6096) [C++] Conditionally depend on boost regex library

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6096.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4985
[https://github.com/apache/arrow/pull/4985]

> [C++] Conditionally depend on boost regex library
> -
>
> Key: ARROW-6096
> URL: https://issues.apache.org/jira/browse/ARROW-6096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There appears to be only one place where the boost regex library is used:
> [cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32]
> I think this can be replaced by the C++11 regex library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (ARROW-5414) [C++] Using "Ninja" build system generator overrides default Release build type on Windows

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5414.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4986
[https://github.com/apache/arrow/pull/4986]

> [C++] Using "Ninja" build system generator overrides default Release build 
> type on Windows
> --
>
> Key: ARROW-5414
> URL: https://issues.apache.org/jira/browse/ARROW-5414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ran into this infuriating issue today. See gist
> https://gist.github.com/wesm/c3dd87279ec20b2f2d12665fd264bfef
> The cmake invocation that produces this is
> {code}
> cmake -G "Ninja" ^
>   -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
>   -DARROW_BUILD_TESTS=on ^
>   -DARROW_CXXFLAGS="/WX /MP" ^
>   -DARROW_GANDIVA=on ^
>   -DARROW_PARQUET=on ^
>   -DARROW_PYTHON=on ^
>   ..
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5564) [C++] Add uriparser to conda-forge

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5564:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Add uriparser to conda-forge
> --
>
> Key: ARROW-5564
> URL: https://issues.apache.org/jira/browse/ARROW-5564
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Developer Tools
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> uriparser is one of the holdouts in our toolchain that is having to be built 
> from source. See also ARROW-5370 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5380) [C++] Fix and enable UBSan for unaligned accesses.

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5380:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Fix and enable UBSan for unaligned accesses.
> --
>
> Key: ARROW-5380
> URL: https://issues.apache.org/jira/browse/ARROW-5380
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Currently unaligned access configuration in UBSan has been turned off.  We 
> should introduce a method that safely loads unaligned data, and use it to fix 
> UBSan errors.  Then turn them on for UBSan.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5351) [Rust] Add support for take kernel functions

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5351:

Fix Version/s: (was: 0.14.1)
   0.15.0

> [Rust] Add support for take kernel functions
> 
>
> Key: ARROW-5351
> URL: https://issues.apache.org/jira/browse/ARROW-5351
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Similar to https://issues.apache.org/jira/browse/ARROW-772, a take function 
> would allow us random-access on arrays, which is useful for sorting and 
> (potentially) filtering.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5560) [C++][Plasma] Cannot create Plasma object after OutOfMemory error

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5560:

Summary: [C++][Plasma] Cannot create Plasma object after OutOfMemory error  
(was: Cannot create Plasma object after OutOfMemory error)

> [C++][Plasma] Cannot create Plasma object after OutOfMemory error
> -
>
> Key: ARROW-5560
> URL: https://issues.apache.org/jira/browse/ARROW-5560
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Affects Versions: 0.13.0
>Reporter: Stephanie Wang
>Assignee: Richard Liaw
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If the client tries to call `CreateObject` and there is not enough memory 
> left in the object store to create it, an `OutOfMemory` error will be 
> returned. However, the plasma store also creates an entry for the object, 
> even though it failed to be created. This means that later on, if the client 
> tries to create the object again, it will receive an error that the object 
> already exists. Also, if the client tries to get the object, it will hang 
> because the entry appears to be unsealed.
> We should fix this by only creating the object entry if the `CreateObject` 
> operation succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5784) [Release][GLib] Replace c_glib/ after running c_glib/autogen.sh in dev/release/02-source.sh

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5784:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Release][GLib] Replace c_glib/ after running c_glib/autogen.sh in 
> dev/release/02-source.sh
> ---
>
> Key: ARROW-5784
> URL: https://issues.apache.org/jira/browse/ARROW-5784
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> c_glib/ source archive is generated by `make dist` because includes configure 
> script.
> The current `dev/release/02-source.sh` build Arrow C++ and Arrow GLib to 
> include the artifacts of GTK-Doc and then run `make dist`. But it is slow. 
> So run only `c_glib/autogen.sh` and then replace c_glib/.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5790) [Python] Passing zero-dim numpy array to pa.array causes segfault

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5790:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Passing zero-dim numpy array to pa.array causes segfault
> -
>
> Key: ARROW-5790
> URL: https://issues.apache.org/jira/browse/ARROW-5790
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: OSX, py37
>Reporter: Brock Mendel
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ```
> import pyarrow as pa
> import numpy as np
> zerod = np.array(0)
> result = pa.array(zerod)  # <-- segfault
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5775) [C++] StructArray : cached boxed fields not thread-safe

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5775:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] StructArray : cached boxed fields not thread-safe
> ---
>
> Key: ARROW-5775
> URL: https://issues.apache.org/jira/browse/ARROW-5775
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The lazy initialization isn't thread-safe (it relies neither on a lock nor on 
> an atomic).
> Perhaps we need a more general "cached property" facility to handle these 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5850) [CI][R] R appveyor job is broken after release

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5850:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [CI][R] R appveyor job is broken after release
> --
>
> Key: ARROW-5850
> URL: https://issues.apache.org/jira/browse/ARROW-5850
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I forgot to add ci/PKGBUILD to the release script that bumps all of the 
> versions, so the pkgver it has is out of sync with what is in r/DESCRIPTION 
> after the release. 
> To fix, either update the pkgver in ci/PKGBUILD and add code to the release 
> script and tests to bump its version, or remove the pkgver field entirely and 
> just read it from r/DESCRIPTION in the pkgver() function, if makepkg allows 
> that. Unclear if one solution is clearly better than the other.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5836) [Java][OSX] Flight tests are failing: address already in use

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5836:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Java][OSX] Flight tests are failing: address already in use
> 
>
> Key: ARROW-5836
> URL: https://issues.apache.org/jira/browse/ARROW-5836
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 0.14.0
>Reporter: Krisztian Szucs
>Assignee: lidavidm
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code}
> Jul 03, 2019 3:09:45 PM io.grpc.netty.NettyServerHandler onStreamError
> WARNING: Stream Error
> io.netty.handler.codec.http2.Http2Exception$StreamException: Received DATA 
> frame for an unknown stream 3
> at 
> io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:129)
> at 
> io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.shouldIgnoreHeadersOrDataFrame(DefaultHttp2ConnectionDecoder.java:531)
> at 
> io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onDataRead(DefaultHttp2ConnectionDecoder.java:183)
> at 
> io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onDataRead(Http2InboundFrameLogger.java:48)
> at 
> io.netty.handler.codec.http2.DefaultHttp2FrameReader.readDataFrame(DefaultHttp2FrameReader.java:421)
> at 
> io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:251)
> at 
> io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:160)
> at 
> io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
> at 
> io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:118)
> at 
> io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:390)
> at 
> io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:450)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> Jul 03, 2019 3:09:46 PM io.grpc.netty.NettyServerHandler onStreamError
> WARNING: Stream Error
> io.netty.handler.codec.http2.Http2Exception$StreamException: Received DATA 
> frame for an unknown stream 3
> at 
> io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:129)
> at 
> io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.shouldIgnoreHeadersOrDataFrame(DefaultHttp2ConnectionDecoder.java:531)
> at 
> io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onDataRead(DefaultHttp2ConnectionDecoder.java:183)
> at 
>

[jira] [Updated] (ARROW-5838) [C++][Flight][OSX] Building 3rdparty grpc cannot find OpenSSL

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5838:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++][Flight][OSX] Building 3rdparty grpc cannot find OpenSSL
> -
>
> Key: ARROW-5838
> URL: https://issues.apache.org/jira/browse/ARROW-5838
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 0.14.0
>Reporter: Krisztian Szucs
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Without system grpc installed compiling grpc cannot find openssl:
> {code}
> Could NOT find OpenSSL, try to set the path to OpenSSL root folder in the
>   system variable OPENSSL_ROOT_DIR (missing: OPENSSL_INCLUDE_DIR)
> Call Stack (most recent call first):
>   
> /usr/local/Cellar/cmake/3.14.5/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378
>  (_FPHSA_FAILURE_MESSAGE)
>   /usr/local/Cellar/cmake/3.14.5/share/cmake/Modules/FindOpenSSL.cmake:413 
> (find_package_handle_standard_args)
>   cmake/ssl.cmake:45 (find_package)
>   CMakeLists.txt:141 (include)
> {code}
> Passing {{ARROW_CMAKE_OPTION="-DOPENSSL_ROOT_DIR=$(brew --prefix openssl)"}} 
> doesn't seem to get forwarded to build_grpc
> {code}
> '/usr/local/Cellar/cmake/3.14.5/bin/cmake' '-DCMAKE_BUILD_TYPE=RELEASE' 
> '-DCMAKE_PREFIX_PATH=';/usr/local;/usr/local;/var/folders/cz/jrwncy5s5cb612sgwscd0z8hgn/T/arrow-0.14.0.X.I2yQzNGd/apache-arrow-0.14.0/cpp/thirdparty/cares_
> ep-install;'' '-DgRPC_CARES_PROVIDER=package' 
> '-DgRPC_GFLAGS_PROVIDER=package' '-DgRPC_PROTOBUF_PROVIDER=package' 
> '-DgRPC_SSL_PROVIDER=package' '-DgRPC_ZLIB_PROVIDER=package' 
> '-DCMAKE_CXX_FLAGS= -Qunused-arguments -fcolor-diagnostics -O3
> -DNDEBUG -O3 -DNDEBUG -fPIC' '-DCMAKE_C_FLAGS= -Qunused-arguments -O3 
> -DNDEBUG -O3 -DNDEBUG -fPIC' 
> '-DCMAKE_INSTALL_PREFIX=/var/folders/cz/jrwncy5s5cb612sgwscd0z8hgn/T/arrow-0.14.0.X.I2yQzNGd/apache-arrow-0.14.0/cpp/thirdparty/grp
> c_ep-install' '-DCMAKE_INSTALL_LIBDIR=lib' 
> '-DProtobuf_PROTOC_LIBRARY=/usr/local/lib/libprotoc.dylib' 
> '-DBUILD_SHARED_LIBS=OFF' '-GUnix Makefiles' 
> '/var/folders/cz/jrwncy5s5cb612sgwscd0z8hgn/T/arrow-0.14.0.X.I2yQzNGd/apache-arrow-
> 0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep'
> {code}
> Installing grpc with brew helps though.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5851) [C++] Compilation of reference benchmarks fails

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5851:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Compilation of reference benchmarks fails
> ---
>
> Key: ARROW-5851
> URL: https://issues.apache.org/jira/browse/ARROW-5851
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code}
> ../src/arrow/util/compression-benchmark.cc: In function 'void 
> arrow::util::StreamingDecompression(arrow::Compression::type, const 
> std::vector&, benchmark::State&)':
> ../src/arrow/util/compression-benchmark.cc:172:5: error: 'ARROW_CHECK' was 
> not declared in this scope
>  ARROW_CHECK(decompressed_size == static_cast(data.size()));
>  ^~~
> ../src/arrow/util/compression-benchmark.cc:172:5: note: suggested 
> alternative: 'ARROW_CONCAT'
>  ARROW_CHECK(decompressed_size == static_cast(data.size()));
>  ^~~
>  ARROW_CONCAT
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5849) [C++] Compiler warnings on mingw-w64

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5849:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Compiler warnings on mingw-w64
> 
>
> Key: ARROW-5849
> URL: https://issues.apache.org/jira/browse/ARROW-5849
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Jeroen
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In mingw64 we see the following warnings:
> {code}
> [ 54%] Building CXX object 
> src/arrow/CMakeFiles/arrow_static.dir/util/io-util.cc.obj
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/decimal.cc:
>  In static member function 'static arrow::Status 
> arrow::Decimal128::FromString(const string_view&, arrow::Decimal128*, 
> int32_t*, int32_t*)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/decimal.cc:313:35:
>  warning: 'dec.arrow::{anonymous}::DecimalComponents::exponent' may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>*scale = -adjusted_exponent + len - 1;
> ~~~^~~~
> {code} 
> {code}
> [ 56%] Building CXX object 
> src/arrow/CMakeFiles/arrow_static.dir/util/string_builder.cc.obj
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:
>  In static member function 'static arrow::Status 
> arrow::internal::TemporaryDir::Make(const string&, 
> std::unique_ptr*)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:897:3:
>  warning: 'created' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
>if (!created) {
>^~
> {code}
> And on mingw32 we also see these:
> {code}
> In file included from 
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/file.cc:25:
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:
>  In function 'void* mmap(void*, size_t, int, int, int, off_t)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:94:62:
>  warning: right shift count >= width of type [-Wshift-count-overflow]
>const DWORD dwMaxSizeHigh = static_cast((maxSize >> 32) & 
> 0xL);
>   ^~
> {code}
> {code}
>  54%] Building CXX object 
> src/arrow/CMakeFiles/arrow_static.dir/util/logging.cc.obj
> In file included from 
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:63:
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:
>  In function 'void* mmap(void*, size_t, int, int, int, off_t)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/io/mman.h:94:62:
>  warning: right shift count >= width of type [-Wshift-count-overflow]
>const DWORD dwMaxSizeHigh = static_cast((maxSize >> 32) & 
> 0xL);
>   ^~
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:
>  In function 'arrow::Status arrow::internal::MemoryMapRemap(void*, size_t, 
> size_t, int, void**)':
> C:/msys64/home/mingw-packages/mingw-w64-arrow/src/arrow/cpp/src/arrow/util/io-util.cc:568:55:
>  warning: right shift count >= width of type [-Wshift-count-overflow]
>LONG new_size_high = static_cast((new_size >> 32) & 0xL);
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5827) [C++] Require c-ares CMake config

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5827:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Require c-ares CMake config
> -
>
> Key: ARROW-5827
> URL: https://issues.apache.org/jira/browse/ARROW-5827
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Because gRPC requires c-ares' CMake config.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5828) [C++] Add Protocol Buffers version check

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5828:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Add Protocol Buffers version check
> 
>
> Key: ARROW-5828
> URL: https://issues.apache.org/jira/browse/ARROW-5828
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> If we use old Protocol Buffers, bundled gRPC reports build error:
> https://lists.apache.org/thread.html/10f8c4d2372638c57c3a956180b2fa3bbd036a27d79eb2eb7b9ffe76@%3Cdev.arrow.apache.org%3E
> {noformat}
>   
> /tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/php_generator.cc:21:10:
> fatal error: google/protobuf/compiler/php/php_generator.h: No such
> file or directory
>#include 
> ^~
>   compilation terminated.
>   make[5]: *** 
> [CMakeFiles/grpc_plugin_support.dir/src/compiler/php_generator.cc.o]
> Error 1
>   make[5]: *** Waiting for unfinished jobs
>   
> /tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/ruby_generator.cc:
> In function ‘grpc::string grpc_ruby_generator::GetServices(const
> FileDescriptor*)’:
>   
> /tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/ruby_generator.cc:165:25:
> error: ‘const class google::protobuf::FileOptions’ has no member named
> ‘has_ruby_package’; did you mean ‘has_java_package’?
>if (file->options().has_ruby_package()) {
>^~~~
>has_java_package
>   
> /tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/ruby_generator.cc:166:38:
> error: ‘const class google::protobuf::FileOptions’ has no member named
> ‘ruby_package’; did you mean ‘java_package’?
>  package_name = file->options().ruby_package();
> ^~~~
> java_package
>   make[5]: *** 
> [CMakeFiles/grpc_plugin_support.dir/src/compiler/ruby_generator.cc.o]
> Error 1
>   make[4]: *** [CMakeFiles/grpc_plugin_support.dir/all] Error 2
>   make[4]: *** Waiting for unfinished jobs
>   make[3]: *** [all] Error 2
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5863) [Python] Segmentation Fault via pytest-runner

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5863:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Segmentation Fault via pytest-runner
> -
>
> Key: ARROW-5863
> URL: https://issues.apache.org/jira/browse/ARROW-5863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: $ uname -a
> Linux aleph 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 
> x86_64 GNU/Linux
> $ python --version
> Python 3.7.3
> $ pip freeze | grep -P "(pyarrow|pytest)"
> pyarrow==0.14.0
> pytest==5.0.0
> pytest-benchmark==3.2.2
> pytest-cov==2.7.1
> pytest-env==0.6.2
> pytest-forked==1.0.2
> pytest-html==1.21.1
> pytest-metadata==1.8.0
> pytest-mock==1.10.4
> pytest-runner==5.1
> pytest-sugar==0.9.2
> pytest-xdist==1.29.0
>Reporter: Josh Bode
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
> Attachments: pyarrow-issue.tar.bz2, pytest-runner.log, pytest.log
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When running {{pytest}} on projects using {{pyarrow==0.14.0}} on Linux, I am 
> getting segmentation faults, but interestingly _only_ when run via 
> {{pytest-runner}} (which provides the {{setup.py pytest}} command)
> This works (i.e. {{pytest}} directly):
> {code:java}
> $ pytest
> Test session starts (platform: linux, Python 3.7.3, pytest 5.0.0, 
> pytest-sugar 0.9.2)
> benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False 
> min_rounds=5 min_time=0.05 max_time=1.0 calibration_precision=10 
> warmup=False warmup_iterations=10)
> rootdir: /home/josh/scratch/pyarrow-issue
> plugins: sugar-0.9.2, Flask-Dance-2.2.0, env-0.6.2, mock-1.10.4, 
> xdist-1.29.0, requests-mock-1.6.0, forked-1.0.2, dash-1.0.0, cov-2.7.1, 
> html-1.21.1, benchmark-3.2.2, metadata-1.8.0
> collecting ...
> tests/test_pyarrow.py ✓ 100% ██
> Results (0.09s):
> 1 passed{code}
> However, this does not work, ending in a segmentation fault, even though the 
> tests pass:
> {code:java}
> $ python setup.py pytest
> running pytest
> running egg_info
> writing pyarrow_issue.egg-info/PKG-INFO
> writing dependency_links to pyarrow_issue.egg-info/dependency_links.txt
> writing requirements to pyarrow_issue.egg-info/requires.txt
> writing top-level names to pyarrow_issue.egg-info/top_level.txt
> reading manifest file 'pyarrow_issue.egg-info/SOURCES.txt'
> writing manifest file 'pyarrow_issue.egg-info/SOURCES.txt'
> running build_ext
> Test session starts (platform: linux, Python 3.7.3, pytest 5.0.0, 
> pytest-sugar 0.9.2)
> benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False 
> min_rounds=5 min_time=0.05 max_time=1.0 calibration_precision=10 
> warmup=False warmup_iterations=10)
> rootdir: /home/josh/scratch/pyarrow-issue
> plugins: sugar-0.9.2, Flask-Dance-2.2.0, env-0.6.2, mock-1.10.4, 
> xdist-1.29.0, requests-mock-1.6.0, forked-1.0.2, dash-1.0.0, cov-2.7.1, 
> html-1.21.1, benchmark-3.2.2, metadata-1.8.0
> collecting ...
> tests/test_pyarrow.py ✓ 100% ██
> Results (0.07s):
> 1 passed
> zsh: segmentation fault (core dumped) python setup.py pytest{code}
> backtrace from {{gdb}}
> {code:java}
> Thread 1 "python" received signal SIGSEGV, Segmentation fault.
> 0x77c10b58 in ?? () from /usr/lib/libpython3.7m.so.1.0
> (gdb) bt
> #0 0x77c10b58 in ?? () from /usr/lib/libpython3.7m.so.1.0
> #1 0x77ae46cc in ?? () from /usr/lib/libpython3.7m.so.1.0
> #2 0x7023a6b3 in arrow::py::PyExtensionType::~PyExtensionType() ()
> from 
> /home/josh/.virtualenvs/default/lib/python3.7/site-packages/pyarrow/./libarrow_python.so.14
> #3 0x7fffed5e6467 in std::unordered_map std::shared_ptr, std::hash, 
> std::equal_to, std::allocator std::shared_ptr > > >::~unordered_map() ()
> from 
> /home/josh/.virtualenvs/default/lib/python3.7/site-packages/pyarrow/./libarrow.so.14
> #4 0x77de5e70 in __run_exit_handlers () from /usr/lib/libc.so.6
> #5 0x77de5fae in exit () from /usr/lib/libc.so.6
> #6 0x77dcfeea in __libc_start_main () from /usr/lib/libc.so.6
> #7 0x505e in _start ()
> {code}
> I have observed this behaviour on my machine running natively, and also via 
> docker. Also, 0.13.0 does not exhibit this behaviour
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5873) [Python] Segmentation fault when comparing schema with None

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5873:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Segmentation fault when comparing schema with None
> ---
>
> Key: ARROW-5873
> URL: https://issues.apache.org/jira/browse/ARROW-5873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Florian Jetter
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When comparing a schema with a Python {{None}} I get a segmentation fault.
> This is a regression to 0.13.0
> {code:java}
> In [2]: import pyarrow as pa
> In [3]: pa.schema([pa.field("something", pa.int64())]).equals(None)
> [1]82085 segmentation fault  ipython
> {code}
> System information:
> System Version: macOS 10.13.6 (17G6030)
> Kernel Version: Darwin 17.7.0
> Python 3.6.7



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5866) [C++] Remove duplicate library in cpp/Brewfile

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5866:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Remove duplicate library in cpp/Brewfile
> --
>
> Key: ARROW-5866
> URL: https://issues.apache.org/jira/browse/ARROW-5866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5856) [Python] linking 3rd party cython modules against pyarrow fails since 0.14.0

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5856:

Fix Version/s: 0.15.0

> [Python] linking 3rd party cython modules against pyarrow fails since 0.14.0
> 
>
> Key: ARROW-5856
> URL: https://issues.apache.org/jira/browse/ARROW-5856
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Steve Stagg
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
> Attachments: setup.py, test.pyx
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> Compiling cython modules that link to the pyarrow library, using the 
> recommended approach for getting the appropriate include and link flags has 
> stopped working for PyArrow 0.14.0.
>  
> A minimal test case is included in the attachments that demonstrates the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5868) [Python] manylinux2010 wheels have shared library dependency on liblz4

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5868:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] manylinux2010 wheels have shared library dependency on liblz4
> --
>
> Key: ARROW-5868
> URL: https://issues.apache.org/jira/browse/ARROW-5868
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Haowei Yu
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> I am using pyarrow in my project. It works well for version 0.13.0
> However, it seems recently there is a release for 0.14.0. After upgrading to 
> the latest, I got this error.
> AttributeError: module 'pyarrow' has no attribute 'compat'
> Stacktrace:
>  2019-07-06 09:08:21 Traceback (most recent call last):
>  2019-07-06 09:08:21 File 
> "/home/jenkins/workspace/CLIENTS_PERF/Tests/ClientsPerf/PythonConnectorPerf/src/PerfTestRunner.py",
>  line 12, in 
>  2019-07-06 09:08:21 import snowflake.connector
>  2019-07-06 09:08:21 File 
> "/home/jenkins/workspace/CLIENTS_PERF/Tests/ClientsPerf/PythonConnectorPerf/pythonconnector-perf/lib/python3.5/site-packages/snowflake/connector/__init__.py",
>  line 21, in 
>  2019-07-06 09:08:21 from .connection import SnowflakeConnection
>  2019-07-06 09:08:21 File 
> "/home/jenkins/workspace/CLIENTS_PERF/Tests/ClientsPerf/PythonConnectorPerf/pythonconnector-perf/lib/python3.5/site-packages/snowflake/connector/connection.py",
>  line 42, in 
>  2019-07-06 09:08:21 from .cursor import SnowflakeCursor, LOG_MAX_QUERY_LENGTH
>  2019-07-06 09:08:21 File 
> "/home/jenkins/workspace/CLIENTS_PERF/Tests/ClientsPerf/PythonConnectorPerf/pythonconnector-perf/lib/python3.5/site-packages/snowflake/connector/cursor.py",
>  line 35, in 
>  2019-07-06 09:08:21 from pyarrow.ipc import open_stream
>  2019-07-06 09:08:21 File 
> "/home/jenkins/workspace/CLIENTS_PERF/Tests/ClientsPerf/PythonConnectorPerf/pythonconnector-perf/lib/python3.5/site-packages/pyarrow/__init__.py",
>  line 47, in 
>  2019-07-06 09:08:21 import pyarrow.compat as compat
> I can provide more detail if requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5874) [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under /usr/local/opt

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5874:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under 
> /usr/local/opt
> ---
>
> Key: ARROW-5874
> URL: https://issues.apache.org/jira/browse/ARROW-5874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: macOS 10.14.5
> Anaconda Python 3.7.3
>Reporter: Michael Anselmi
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available, pyarrow, wheel
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hello, and congrats on the recent release of Apache Arrow 0.14.0!
> This morning I installed pyarrow 0.14.0 on my macOS 10.14.5 system like so:
> {code:java}
> python3.7 -m venv ~/virtualenv/pyarrow-0.14.0
> source ~/virtualenv/pyarrow-0.14.0/bin/activate
> pip install --upgrade pip setuptools
> pip install pyarrow  # installs 
> pyarrow-0.14.0-cp37-cp37m-macosx_10_6_intel.whl
> pip freeze --all
> # numpy==1.16.4
> # pip==19.1.1
> # pyarrow==0.14.0
> # setuptools==41.0.1
> # six==1.12.0
> {code}
> However I am unable to import pyarrow:
> {code:java}
> python -c 'import pyarrow'
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File 
> "/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> # from pyarrow.lib import cpu_count, set_cpu_count
> # ImportError: 
> dlopen(/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so,
>  2): Library not loaded: /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #   Referenced from: 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> #   Reason: image not found
> {code}
> pyarrow is trying to load a shared library (OpenSSL in this case) from a path 
> under {{/usr/local/opt}} that doesn't exist; perhaps that OpenSSL had been 
> provided by Homebrew as part of your build process?  Unfortunately this makes 
> the pyarrow 0.14.0 wheel completely unusable on my system or any system that 
> doesn't have OpenSSL installed in that location.  This is a regression from 
> pyarrow 0.13.0 as those wheels "just worked".
> Additional diagnostic output below.  I ran {{otool -L}} on each {{.dylib}} 
> and {{.so}} file in 
> {{/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow}}
>  and included the output for those with dependencies under {{/usr/local/opt}}:
> {code:java}
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib:
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 
> 1.2.8)
> # @rpath/libarrow_boost_system.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # @rpath/libarrow_boost_filesystem.dylib (compatibility version 
> 0.0.0, current version 0.0.0)
> # @rpath/libarrow_boost_regex.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib:
> # @rpath/libarrow_flight.14.dylib (compatibility version 14.0.0, 
> current version 14.0.0)
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
>

[jira] [Updated] (ARROW-5889) [Python][C++] Parquet backwards compat for timestamps without timezone broken

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5889:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python][C++] Parquet backwards compat for timestamps without timezone broken
> -
>
> Key: ARROW-5889
> URL: https://issues.apache.org/jira/browse/ARROW-5889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Florian Jetter
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: parquet, pull-request-available
> Fix For: 0.14.1, 0.15.0
>
> Attachments: 0.12.1.parquet, 0.13.0.parquet
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When reading a parquet file which has timestamp fields they are read as a 
> timestamp with timezone UTC if the parquet file was written by pyarrow 0.13.0 
> and/or 0.12.1.
> Expected behavior would be that they are loaded as timestamps without any 
> timezone information.
> The attached files contain one row for all basic types and a few nested 
> types, the timestamp fields are called datetime64 and datetime64_tz
> see also 
> [https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat]
> [https://github.com/JDASoftwareGroup/kartothek/blob/c47e52116e2dc726a74d7d6b97922a0252722ed0/tests/serialization/test_arrow_compat.py#L31]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5904) [Java] [Plasma] Fix compilation of Plasma Java client

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5904:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Java] [Plasma] Fix compilation of Plasma Java client
> -
>
> Key: ARROW-5904
> URL: https://issues.apache.org/jira/browse/ARROW-5904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is broken since the introduction of user-defined Status messages:
> {code:java}
> external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:
>  In function '_jobject* 
> Java_org_apache_arrow_plasma_PlasmaClientJNI_create(JNIEnv*, jclass, jlong, 
> jbyteArray, jint, jbyteArray)':
> external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:114:9:
>  error: 'class arrow::Status' has no member named 'IsPlasmaObjectExists'
>if (s.IsPlasmaObjectExists()) {
>  ^
> external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:120:9:
>  error: 'class arrow::Status' has no member named 'IsPlasmaStoreFull'
>if (s.IsPlasmaStoreFull()) {
>  ^{code}
> [~guoyuhong85] Can you add this codepath to the test so we can catch this 
> kind of breakage more quickly in the future?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5887) [C#] ArrowStreamWriter writes FieldNodes in wrong order

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5887:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C#] ArrowStreamWriter writes FieldNodes in wrong order
> ---
>
> Key: ARROW-5887
> URL: https://issues.apache.org/jira/browse/ARROW-5887
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Eric Erhardt
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>   Original Estimate: 4h
>  Time Spent: 0.5h
>  Remaining Estimate: 3.5h
>
> When ArrowStreamWriter is writing a {{RecordBatch}} with {{null}}s in it, it 
> is mixing up the column's {{NullCount}}.
> You can see here:
> [https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L195-L200]
> It is writing the fields from {{0}} -> {{fieldCount}} order. But then 
> [lower|https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L216-L220],
>  it is writing the fields from {{fieldCount}} -> {{0}}.
> Looking at the [Java 
> implementation|https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/FBSerializables.java#L36-L44]
>  it says
> {quote}// struct vectors have to be created in reverse order
> {quote}
>  
> A simple test of roundtripping the following RecordBatch shows the issue:
>  
> {code:java}
> var result = new RecordBatch(
> new Schema.Builder()
> .Field(f => f.Name("age").DataType(Int32Type.Default))
> .Field(f => f.Name("CharCount").DataType(Int32Type.Default))
> .Build(),
> new IArrowArray[]
> {
> new Int32Array(
> new ArrowBuffer.Builder().Append(0).Build(),
> new ArrowBuffer.Builder().Append(0).Build(),
> length: 1,
> nullCount: 1,
> offset: 0),
> new Int32Array(
> new ArrowBuffer.Builder().Append(7).Build(),
> ArrowBuffer.Empty,
> length: 1,
> nullCount: 0,
> offset: 0)
> },
> length: 1);
> {code}
> Here, the "age" column should have a `null` in it. However, when you write 
> and read this RecordBatch back, you see that the "CharCount" column has 
> `NullCount` == 1 and "age" column has `NullCount` == 0.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5878) [Python][C++] Parquet reader not forward compatible for timestamps without timezone

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5878:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python][C++] Parquet reader not forward compatible for timestamps without 
> timezone
> ---
>
> Key: ARROW-5878
> URL: https://issues.apache.org/jira/browse/ARROW-5878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Florian Jetter
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.1, 0.15.0
>
> Attachments: timezones_pyarrow_14.paquet
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Timestamps without timezone which are written by pyarrow 0.14.0 cannot be 
> read anymore as timestamps by earlier versions. The timestamp is read as an 
> integer when reading in with pyarrow 0.13.0
> Looking at the parquet schemas, it seems that the logical type cannot be 
> understood by the older versions, see below.
> h4. File generation with pyarrow 0.14.0
> {code:java}
> import datetime
> import pyarrow.parquet as pq
> import pandas as pd
> df = pd.DataFrame(
> {
> "datetime64": pd.Series(["2018-01-01"], dtype="datetime64[ns]"),
> "datetime64_ts": pd.Series(
> [pd.Timestamp(datetime.datetime(2018, 1, 1), tz="Europe/Berlin")],
> dtype="datetime64[ns]",
> ),
> }
> )
> pq.write_table(pa.Table.from_pandas(df), "timezones_pyarrow_14.paquet")
> {code}
> h4. Reading with pyarrow 0.13.0
> {code:java}
> In [1]: import pyarrow.parquet as pq
> In [2]: import pyarrow as pa
> In [3]: with open("timezones_pyarrow_14.paquet", "rb") as fd:
>...: table = pq.read_pandas(fd)
>...:
> In [4]: table.to_pandas()
> Out[4]:
>  datetime64 datetime64_ts
> 0  15147648 2018-01-01 00:00:00+01:00
> In [5]: table.to_pandas().dtypes
> Out[5]:
> datetime64   int64
> datetime64_tsdatetime64[ns, Europe/Berlin]
> dtype: object
> {code}
> h3. Parquet schema as seen by pyarrow versions:
> pyarrow 0.13.0 parquet schema
> {code:java}
> datetime64: INT64
> datetime64_ts: INT64 TIMESTAMP_MICROS
> {code}
> pyarrow 0.14.0 parquet schema
> {code:java}
> datetime64: INT64 Timestamp(isAdjustedToUTC=false, timeUnit=microseconds)
> datetime64_ts: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5899) [Python][Packaging] Bundle uriparser.dll in windows wheels

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5899:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python][Packaging] Bundle uriparser.dll in windows wheels 
> ---
>
> Key: ARROW-5899
> URL: https://issues.apache.org/jira/browse/ARROW-5899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The windows nightly wheel builds are failing: 
> https://ci.appveyor.com/project/Ursa-Labs/crossbow/builds/25688922 probably 
> caused by 88fcb09, but it's hard to tell because of the error message 
> "ImportError: DLL load failed: The specified module could not be found." is 
> not very descriptive.
> Theoretically it shouldn't affect the 0.14 release because 88fcb09 was added 
> afterwards.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5908) [C#] ArrowStreamWriter doesn't align buffers to 8 bytes

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5908:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C#] ArrowStreamWriter doesn't align buffers to 8 bytes
> ---
>
> Key: ARROW-5908
> URL: https://issues.apache.org/jira/browse/ARROW-5908
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Eric Erhardt
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When writing RecordBatches using ArrowStreamWriter, if the ArrowBuffers being 
> written aren't all 8 byte aligned, the serialized RecordBatch won't conform 
> to the Arrow specification. This leads to other languages' readers to throw 
> an error when reading Arrow streams written by the C# writer.
> For example, if reading the stream from Python or C++, an error is raised 
> here: 
> [https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/ipc/reader.cc#L107-L110]
> A similar error is raised when Java tries to read the stream.
> We should be ensuring that the buffers being written to the stream are padded 
> to 8 bytes, no matter their length, as specified in 
> [https://arrow.apache.org/docs/format/Layout.html#requirements-goals-and-non-goals]
>  
> {quote} * It is required to have all the contiguous memory buffers in an IPC 
> payload aligned at 8-byte boundaries. In other words, each buffer must start 
> at an aligned 8-byte offset. Additionally, each buffer should be padded to a 
> multiple of 8 bytes.{quote}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5939) [Release] Add support for generating vote email template separately

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5939:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Release] Add support for generating vote email template separately
> ---
>
> Key: ARROW-5939
> URL: https://issues.apache.org/jira/browse/ARROW-5939
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5938) [Release] Create branch for adding release note automatically

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5938:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Release] Create branch for adding release note automatically
> -
>
> Key: ARROW-5938
> URL: https://issues.apache.org/jira/browse/ARROW-5938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5921:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++][Fuzzing] Missing nullptr checks in IPC
> 
>
> Key: ARROW-5921
> URL: https://issues.apache.org/jira/browse/ARROW-5921
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>  Labels: fuzzer, pull-request-available
> Fix For: 0.14.1, 0.15.0
>
> Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, 
> crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, 
> crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, 
> crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, 
> crash-fd237566879dc60fff4d956d5fe3533d74a367f3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-xxx
> {code}
> The attached crashes have all distinct sources and are all related with 
> missing nullptr checks. I have a fix basically ready.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5934) [Python] Bundle arrow's LICENSE with the wheels

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5934:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Bundle arrow's LICENSE with the wheels
> ---
>
> Key: ARROW-5934
> URL: https://issues.apache.org/jira/browse/ARROW-5934
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Guide to bundle LICENSE files with the wheels: 
> https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file
> We also need to ensure, that all thirdparty dependencies' license are 
> attached to it, especially because we're statically linking multiple 3rdparty 
> dependencies, and for example uriparser is missing from the LICENSE file.
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5937) [Release] Stop parallel binary upload

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5937:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Release] Stop parallel binary upload
> -
>
> Key: ARROW-5937
> URL: https://issues.apache.org/jira/browse/ARROW-5937
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5940) [Release] Add support for re-uploading sign/checksum for binary artifacts

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5940:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Release] Add support for re-uploading sign/checksum for binary artifacts
> -
>
> Key: ARROW-5940
> URL: https://issues.apache.org/jira/browse/ARROW-5940
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5941) [Release] Avoid re-uploading already uploaded binary artifacts

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5941:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Release] Avoid re-uploading already uploaded binary artifacts
> --
>
> Key: ARROW-5941
> URL: https://issues.apache.org/jira/browse/ARROW-5941
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5958) [Python] Link zlib statically in the wheels

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5958:

Fix Version/s: 0.15.0

> [Python] Link zlib statically in the wheels
> ---
>
> Key: ARROW-5958
> URL: https://issues.apache.org/jira/browse/ARROW-5958
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 0.15.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Bundling dependencies statically is preferred over bundling as shared libs. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (ARROW-5959) [C++][CI] Fuzzit does not know about branch + commit hash

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5959.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4952
[https://github.com/apache/arrow/pull/4952]

> [C++][CI] Fuzzit does not know about branch + commit hash
> -
>
> Key: ARROW-5959
> URL: https://issues.apache.org/jira/browse/ARROW-5959
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>  Labels: CI, fuzzer, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Reported 
> [here|https://github.com/apache/arrow/pull/4504#issuecomment-509932673], 
> fuzzit does not seem to retrieve the branch + commit hash, which is bad for 
> tracking.
> h2. AC
>  * Fix CI setup 
> ([hint|https://github.com/apache/arrow/pull/4504#issuecomment-510415931])
>  * Use {{set -euxo pipefail}} in 
> [\{{docker_build_and_fuzzit.sh}}]([https://github.com/apache/arrow/blob/master/ci/docker_build_and_fuzzit.sh])
>  to prevent this issue in the future



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (ARROW-6068) [Python] Hypothesis test failure

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6068.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4981
[https://github.com/apache/arrow/pull/4981]

> [Python] Hypothesis test failure
> 
>
> Key: ARROW-6068
> URL: https://issues.apache.org/jira/browse/ARROW-6068
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {code:java}
> $ python -m pytest --hypothesis-seed=249088922892200171383018406164042644900 
> --hypothesis --tb=native pyarrow/tests/test_strategies.py 
> === 
> test session starts 
> 
> platform linux -- Python 3.7.3, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
> hypothesis profile 'dev' -> max_examples=10, 
> database=DirectoryBasedExampleDatabase('/home/antoine/arrow/dev/python/.hypothesis/examples')
> rootdir: /home/antoine/arrow/dev/python, inifile: setup.cfg
> plugins: timeout-1.3.3, repeat-0.8.0, hypothesis-3.82.1, forked-1.0.2, 
> xdist-1.28.0
> collected 7 items 
>   
>
> pyarrow/tests/test_strategies.py ..F  
>   
>  [100%]
> =
>  FAILURES 
> =
> ___
>  test_tables 
> 
> Traceback (most recent call last):
>   File "/home/antoine/arrow/dev/python/pyarrow/tests/test_strategies.py", 
> line 55, in test_tables
> def test_tables(table):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/hypothesis/core.py",
>  line 960, in wrapped_test
> raise the_error_hypothesis_found
>   File "/home/antoine/arrow/dev/python/pyarrow/tests/strategies.py", line 
> 249, in tables
> return pa.Table.from_arrays(children, schema=schema)
>   File "pyarrow/table.pxi", line 1018, in pyarrow.lib.Table.from_arrays
> return pyarrow_wrap_table(CTable.Make(c_schema, columns))
>   File "pyarrow/public-api.pxi", line 314, in pyarrow.lib.pyarrow_wrap_table
> check_status(ctable.get().Validate())
>   File "pyarrow/error.pxi", line 76, in pyarrow.lib.check_status
> raise ArrowInvalid(message)
> pyarrow.lib.ArrowInvalid: Column data for field 11 with type struct<: null, : 
> null, : null, : null, : null, : null> is inconsistent with schema struct<: 
> null, : null, : null, : null not null, : null, : null>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6068) [Python] Hypothesis test failure, Add StructType::Make that accepts vector of fields

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6068:

Summary: [Python] Hypothesis test failure, Add StructType::Make that 
accepts vector of fields  (was: [Python] Hypothesis test failure)

> [Python] Hypothesis test failure, Add StructType::Make that accepts vector of 
> fields
> 
>
> Key: ARROW-6068
> URL: https://issues.apache.org/jira/browse/ARROW-6068
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {code:java}
> $ python -m pytest --hypothesis-seed=249088922892200171383018406164042644900 
> --hypothesis --tb=native pyarrow/tests/test_strategies.py 
> === 
> test session starts 
> 
> platform linux -- Python 3.7.3, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
> hypothesis profile 'dev' -> max_examples=10, 
> database=DirectoryBasedExampleDatabase('/home/antoine/arrow/dev/python/.hypothesis/examples')
> rootdir: /home/antoine/arrow/dev/python, inifile: setup.cfg
> plugins: timeout-1.3.3, repeat-0.8.0, hypothesis-3.82.1, forked-1.0.2, 
> xdist-1.28.0
> collected 7 items 
>   
>
> pyarrow/tests/test_strategies.py ..F  
>   
>  [100%]
> =
>  FAILURES 
> =
> ___
>  test_tables 
> 
> Traceback (most recent call last):
>   File "/home/antoine/arrow/dev/python/pyarrow/tests/test_strategies.py", 
> line 55, in test_tables
> def test_tables(table):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/hypothesis/core.py",
>  line 960, in wrapped_test
> raise the_error_hypothesis_found
>   File "/home/antoine/arrow/dev/python/pyarrow/tests/strategies.py", line 
> 249, in tables
> return pa.Table.from_arrays(children, schema=schema)
>   File "pyarrow/table.pxi", line 1018, in pyarrow.lib.Table.from_arrays
> return pyarrow_wrap_table(CTable.Make(c_schema, columns))
>   File "pyarrow/public-api.pxi", line 314, in pyarrow.lib.pyarrow_wrap_table
> check_status(ctable.get().Validate())
>   File "pyarrow/error.pxi", line 76, in pyarrow.lib.check_status
> raise ArrowInvalid(message)
> pyarrow.lib.ArrowInvalid: Column data for field 11 with type struct<: null, : 
> null, : null, : null, : null, : null> is inconsistent with schema struct<: 
> null, : null, : null, : null not null, : null, : null>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6002:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] TestCastFunctions does not test int64 casting`
> -
>
> Key: ARROW-6002
> URL: https://issues.apache.org/jira/browse/ARROW-6002
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
>
> {{outputs[2]}} (corresponds to cast from float32) is checked twice 
> https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478
>  while {{outputs[1]}} is not checked (corresponds to cast from int64)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6105) [C++][Parquet][Python] Add test case showing dictionary-encoded subfields in nested type

2019-08-01 Thread Wes McKinney (JIRA)

Wes McKinney created ARROW-6105:
---

 Summary: [C++][Parquet][Python] Add test case showing 
dictionary-encoded subfields in nested type
 Key: ARROW-6105
 URL: https://issues.apache.org/jira/browse/ARROW-6105
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


As follow up to ARROW-6077 -- this is fixed, but not yet fully tested. To 
contain the scope of ARROW-6077 I will add a test as a follow up



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6104) [Rust] [DataFusion] Don't allow bare_trait_objects

2019-08-01 Thread Andy Grove (JIRA)

Andy Grove created ARROW-6104:
-

 Summary: [Rust] [DataFusion] Don't allow bare_trait_objects
 Key: ARROW-6104
 URL: https://issues.apache.org/jira/browse/ARROW-6104
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


Need to remove "{color:#808080}#![allow(bare_trait_objects)]" from cargo.toml 
and fix compiler warnings
{color}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6103) [Java] Do we really want to use the maven release plugin?

2019-08-01 Thread Andy Grove (JIRA)

Andy Grove created ARROW-6103:
-

 Summary: [Java] Do we really want to use the maven release plugin?
 Key: ARROW-6103
 URL: https://issues.apache.org/jira/browse/ARROW-6103
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


For reference .. I'm filing this issue to track investigation work around this 
..
{code:java}
The biggest problem for the Git commit is our Java package
requires "apache-arrow-${VERSION}" tag on
https://github.com/apache/arrow . (Right?)
I think that "mvm release:perform" in
dev/release/01-perform.sh does so but I don't know the
details of "mvm release:perform"...{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Closed] (ARROW-6102) [Testing] Add partitioned CSV file to arrow-testing repo

2019-08-01 Thread Andy Grove (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-6102.
-
Resolution: Won't Fix

I will have the Rust unit tests dynamically create partitioned files instead

> [Testing] Add partitioned CSV file to arrow-testing repo
> 
>
> Key: ARROW-6102
> URL: https://issues.apache.org/jira/browse/ARROW-6102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I need to add a partitioned CSV file to arrow-testing for use in parallel 
> query unit tests in DataFusion



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[GitHub] [arrow-testing] andygrove commented on issue #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

andygrove commented on issue #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7#issuecomment-517421424
 
 
   ok thanks @wesm I will do that ... works better for testing different number 
of partitions too


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] andygrove closed pull request #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

andygrove closed pull request #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] wesm commented on issue #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

wesm commented on issue #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7#issuecomment-517420863
 
 
   Yeah, that's what we do in Python for our partitioned Parquet tests, for 
examle


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] wesm edited a comment on issue #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

wesm edited a comment on issue #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7#issuecomment-517420863
 
 
   Yeah, that's what we do in Python for our partitioned Parquet tests, for 
example


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] andygrove commented on issue #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

andygrove commented on issue #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7#issuecomment-517419157
 
 
   Oh, you mean generate a partitioned version directly from the Rust unit 
test? I guess that could work.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] andygrove commented on issue #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

andygrove commented on issue #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7#issuecomment-517418904
 
 
   My thinking was that I could run queries against the single file and the 
partitioned version and check that the results match


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-testing] wesm commented on issue #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

wesm commented on issue #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7#issuecomment-517418210
 
 
   Wouldn't it be better to generate an example?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (ARROW-6102) [Testing] Add partitioned CSV file to arrow-testing repo

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6102:
--
Labels: pull-request-available  (was: )

> [Testing] Add partitioned CSV file to arrow-testing repo
> 
>
> Key: ARROW-6102
> URL: https://issues.apache.org/jira/browse/ARROW-6102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> I need to add a partitioned CSV file to arrow-testing for use in parallel 
> query unit tests in DataFusion



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[GitHub] [arrow-testing] andygrove opened a new pull request #7: ARROW-6102: Add partitioned CSV example

2019-08-01 Thread GitBox

andygrove opened a new pull request #7: ARROW-6102: Add partitioned CSV example
URL: https://github.com/apache/arrow-testing/pull/7
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (ARROW-6102) [Testing] Add partitioned CSV file to arrow-testing repo

2019-08-01 Thread Andy Grove (JIRA)

Andy Grove created ARROW-6102:
-

 Summary: [Testing] Add partitioned CSV file to arrow-testing repo
 Key: ARROW-6102
 URL: https://issues.apache.org/jira/browse/ARROW-6102
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


I need to add a partitioned CSV file to arrow-testing for use in parallel query 
unit tests in DataFusion



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-4224) [Python] Support integration with pydata/sparse library

2019-08-01 Thread Rok Mihevc (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-4224:
--
Labels: pull-request-available sparse  (was: sparse)

> [Python] Support integration with pydata/sparse library
> ---
>
> Key: ARROW-4224
> URL: https://issues.apache.org/jira/browse/ARROW-4224
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Kenta Murata
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: pull-request-available, sparse
>
> It would be great to support integration with pydata/sparse library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Closed] (ARROW-6098) [C++] Partially mitigating CPU scaling effects in benchmarks

2019-08-01 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6098.
---
Resolution: Not A Problem

Got it, thanks

> [C++] Partially mitigating CPU scaling effects in benchmarks
> 
>
> Key: ARROW-6098
> URL: https://issues.apache.org/jira/browse/ARROW-6098
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have a lot of benchmarks that return results based on a single iteration
> {code}
> (arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
> ./release/arrow-builder-benchmark --benchmark_filter=Dict
> 2019-08-01 10:46:03
> Running ./release/arrow-builder-benchmark
> Run on (12 X 4400 MHz CPU s)
> CPU Caches:
>   L1 Data 32K (x6)
>   L1 Instruction 32K (x6)
>   L2 Unified 256K (x6)
>   L3 Unified 12288K (x1)
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> ---
> BenchmarkTime   CPU Iterations
> ---
> BuildInt64DictionaryArrayRandom  622889286 ns  622864485 ns  1   
> 411.004MB/s
> BuildInt64DictionaryArraySequential  546764048 ns  545992395 ns  1   
> 468.871MB/s
> BuildInt64DictionaryArraySimilar 737759293 ns  737696850 ns  1   
> 347.026MB/s
> BuildStringDictionaryArray   985433473 ns  985363901 ns  1   
> 346.608MB/s
> (arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
> ./release/arrow-builder-benchmark --benchmark_filter=Dict
> 2019-08-01 10:46:09
> Running ./release/arrow-builder-benchmark
> Run on (12 X 4400 MHz CPU s)
> CPU Caches:
>   L1 Data 32K (x6)
>   L1 Instruction 32K (x6)
>   L2 Unified 256K (x6)
>   L3 Unified 12288K (x1)
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> ---
> BenchmarkTime   CPU Iterations
> ---
> BuildInt64DictionaryArrayRandom  527063570 ns  527044023 ns  1   
> 485.728MB/s
> BuildInt64DictionaryArraySequential  566285427 ns  566270336 ns  1   
> 452.081MB/s
> BuildInt64DictionaryArraySimilar 762954193 ns  762332297 ns  1   
> 335.812MB/s
> BuildStringDictionaryArray   991095766 ns  991018875 ns  1
> 344.63MB/s
> {code}
> I'm sure the result here is being heavily affected by CPU scaling but I think 
> we can mitigate the impact of CPU scaling by using the `MinTime`. I find that 
> adding `MinTime(1.0)` to these particular benchmarks makes them more 
> consistent



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6098) [C++] Partially mitigating CPU scaling effects in benchmarks

2019-08-01 Thread Francois Saint-Jacques (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898202#comment-16898202
 ] 

Francois Saint-Jacques commented on ARROW-6098:
---

Yes, I removed most of the static `MinTime` and `Repetitions` to let archery 
pass them as parameters. `--benchmark_min_time` and `--benchmark_repetitions` 
are the preferred way.

> [C++] Partially mitigating CPU scaling effects in benchmarks
> 
>
> Key: ARROW-6098
> URL: https://issues.apache.org/jira/browse/ARROW-6098
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have a lot of benchmarks that return results based on a single iteration
> {code}
> (arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
> ./release/arrow-builder-benchmark --benchmark_filter=Dict
> 2019-08-01 10:46:03
> Running ./release/arrow-builder-benchmark
> Run on (12 X 4400 MHz CPU s)
> CPU Caches:
>   L1 Data 32K (x6)
>   L1 Instruction 32K (x6)
>   L2 Unified 256K (x6)
>   L3 Unified 12288K (x1)
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> ---
> BenchmarkTime   CPU Iterations
> ---
> BuildInt64DictionaryArrayRandom  622889286 ns  622864485 ns  1   
> 411.004MB/s
> BuildInt64DictionaryArraySequential  546764048 ns  545992395 ns  1   
> 468.871MB/s
> BuildInt64DictionaryArraySimilar 737759293 ns  737696850 ns  1   
> 347.026MB/s
> BuildStringDictionaryArray   985433473 ns  985363901 ns  1   
> 346.608MB/s
> (arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
> ./release/arrow-builder-benchmark --benchmark_filter=Dict
> 2019-08-01 10:46:09
> Running ./release/arrow-builder-benchmark
> Run on (12 X 4400 MHz CPU s)
> CPU Caches:
>   L1 Data 32K (x6)
>   L1 Instruction 32K (x6)
>   L2 Unified 256K (x6)
>   L3 Unified 12288K (x1)
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> ---
> BenchmarkTime   CPU Iterations
> ---
> BuildInt64DictionaryArrayRandom  527063570 ns  527044023 ns  1   
> 485.728MB/s
> BuildInt64DictionaryArraySequential  566285427 ns  566270336 ns  1   
> 452.081MB/s
> BuildInt64DictionaryArraySimilar 762954193 ns  762332297 ns  1   
> 335.812MB/s
> BuildStringDictionaryArray   991095766 ns  991018875 ns  1
> 344.63MB/s
> {code}
> I'm sure the result here is being heavily affected by CPU scaling but I think 
> we can mitigate the impact of CPU scaling by using the `MinTime`. I find that 
> adding `MinTime(1.0)` to these particular benchmarks makes them more 
> consistent



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6101) [Rust] [DataFusion] Create physical plan from logical plan

2019-08-01 Thread Andy Grove (JIRA)

Andy Grove created ARROW-6101:
-

 Summary: [Rust] [DataFusion] Create physical plan from logical plan
 Key: ARROW-6101
 URL: https://issues.apache.org/jira/browse/ARROW-6101
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove
Assignee: Andy Grove


Once the physical plan is in place and can be executed, I will implement logic 
to convert the logical plan to a physical plan and remove the legacy code for 
directly executing a logical plan.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6088) [Rust] [DataFusion] Implement parallel execution for projection

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6088:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Implement parallel execution for projection
> ---
>
> Key: ARROW-6088
> URL: https://issues.apache.org/jira/browse/ARROW-6088
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6100) [Rust] Pin to specific Rust nightly release

2019-08-01 Thread Andy Grove (JIRA)

Andy Grove created ARROW-6100:
-

 Summary: [Rust] Pin to specific Rust nightly release
 Key: ARROW-6100
 URL: https://issues.apache.org/jira/browse/ARROW-6100
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0


Builds are currently non-deterministic because rust-toolchain contains 
"nightly" meaning "use the latest nightly release of Rust". This can cause 
build seemingly random build failure in CI. I propose we modify rust-toolchain 
to refer to a specific nightly release e.g. "nightly-2019-07-31" so that builds 
are deterministic.

We can update this nightly version when needed (e.g. to pick up new features) 
as part of the regular PR process.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-01 Thread Haowei Yu (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haowei Yu updated ARROW-6099:
-
Description: 
Currently, the java library directly calls slf4j api, and there is no abstract 
layer. This leads to user need to install slf4j as a requirement even if we 
don't use slf4j at all. 

 

It is best if you can change the slf4j dependency scope to provided and log 
content only if slf4j jar file is provided at runtime.

  was:
Currently, the java library directly calls slf4j api, and there is no abstract 
layer. This leads to user need to install slf4j as a requirement even if we 
don't use slf4j at all. 

 

It is best if you can change the slf4j dependency to provided and log content 
only if slf4j jar file is provided at runtime.


> [JAVA] Has the ability to not using slf4j logging framework
> ---
>
> Key: ARROW-6099
> URL: https://issues.apache.org/jira/browse/ARROW-6099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Haowei Yu
>Priority: Major
>
> Currently, the java library directly calls slf4j api, and there is no 
> abstract layer. This leads to user need to install slf4j as a requirement 
> even if we don't use slf4j at all. 
>  
> It is best if you can change the slf4j dependency scope to provided and log 
> content only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-01 Thread Haowei Yu (JIRA)

Haowei Yu created ARROW-6099:


 Summary: [JAVA] Has the ability to not using slf4j logging 
framework
 Key: ARROW-6099
 URL: https://issues.apache.org/jira/browse/ARROW-6099
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 0.14.1
Reporter: Haowei Yu


Currently, the java library directly calls slf4j api, and there is no abstract 
layer. This leads to user need to install slf4j as a requirement even if we 
don't use slf4j at all. 

 

It is best if you can change the slf4j dependency to provided and log content 
only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6096) [C++] Conditionally depend on boost regex library

2019-08-01 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated ARROW-6096:
---
Component/s: C++

> [C++] Conditionally depend on boost regex library
> -
>
> Key: ARROW-6096
> URL: https://issues.apache.org/jira/browse/ARROW-6096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There appears to be only one place where the boost regex library is used:
> [cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32]
> I think this can be replaced by the C++11 regex library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6098) [C++] Partially mitigating CPU scaling effects in benchmarks

2019-08-01 Thread Antoine Pitrou (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898165#comment-16898165
 ] 

Antoine Pitrou commented on ARROW-6098:
---

I think you can simply pass {{--benchmark_min_time=1.0}} on the command line.

> [C++] Partially mitigating CPU scaling effects in benchmarks
> 
>
> Key: ARROW-6098
> URL: https://issues.apache.org/jira/browse/ARROW-6098
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have a lot of benchmarks that return results based on a single iteration
> {code}
> (arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
> ./release/arrow-builder-benchmark --benchmark_filter=Dict
> 2019-08-01 10:46:03
> Running ./release/arrow-builder-benchmark
> Run on (12 X 4400 MHz CPU s)
> CPU Caches:
>   L1 Data 32K (x6)
>   L1 Instruction 32K (x6)
>   L2 Unified 256K (x6)
>   L3 Unified 12288K (x1)
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> ---
> BenchmarkTime   CPU Iterations
> ---
> BuildInt64DictionaryArrayRandom  622889286 ns  622864485 ns  1   
> 411.004MB/s
> BuildInt64DictionaryArraySequential  546764048 ns  545992395 ns  1   
> 468.871MB/s
> BuildInt64DictionaryArraySimilar 737759293 ns  737696850 ns  1   
> 347.026MB/s
> BuildStringDictionaryArray   985433473 ns  985363901 ns  1   
> 346.608MB/s
> (arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
> ./release/arrow-builder-benchmark --benchmark_filter=Dict
> 2019-08-01 10:46:09
> Running ./release/arrow-builder-benchmark
> Run on (12 X 4400 MHz CPU s)
> CPU Caches:
>   L1 Data 32K (x6)
>   L1 Instruction 32K (x6)
>   L2 Unified 256K (x6)
>   L3 Unified 12288K (x1)
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
> may be noisy and will incur extra overhead.
> ---
> BenchmarkTime   CPU Iterations
> ---
> BuildInt64DictionaryArrayRandom  527063570 ns  527044023 ns  1   
> 485.728MB/s
> BuildInt64DictionaryArraySequential  566285427 ns  566270336 ns  1   
> 452.081MB/s
> BuildInt64DictionaryArraySimilar 762954193 ns  762332297 ns  1   
> 335.812MB/s
> BuildStringDictionaryArray   991095766 ns  991018875 ns  1
> 344.63MB/s
> {code}
> I'm sure the result here is being heavily affected by CPU scaling but I think 
> we can mitigate the impact of CPU scaling by using the `MinTime`. I find that 
> adding `MinTime(1.0)` to these particular benchmarks makes them more 
> consistent



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6098) [C++] Partially mitigating CPU scaling effects in benchmarks

2019-08-01 Thread Wes McKinney (JIRA)

Wes McKinney created ARROW-6098:
---

 Summary: [C++] Partially mitigating CPU scaling effects in 
benchmarks
 Key: ARROW-6098
 URL: https://issues.apache.org/jira/browse/ARROW-6098
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We have a lot of benchmarks that return results based on a single iteration


{code}
(arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
./release/arrow-builder-benchmark --benchmark_filter=Dict
2019-08-01 10:46:03
Running ./release/arrow-builder-benchmark
Run on (12 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 
be noisy and will incur extra overhead.
---
BenchmarkTime   CPU Iterations
---
BuildInt64DictionaryArrayRandom  622889286 ns  622864485 ns  1   
411.004MB/s
BuildInt64DictionaryArraySequential  546764048 ns  545992395 ns  1   
468.871MB/s
BuildInt64DictionaryArraySimilar 737759293 ns  737696850 ns  1   
347.026MB/s
BuildStringDictionaryArray   985433473 ns  985363901 ns  1   
346.608MB/s
(arrow-3.7) 10:46 ~/code/arrow/cpp/build  (master)$ 
./release/arrow-builder-benchmark --benchmark_filter=Dict
2019-08-01 10:46:09
Running ./release/arrow-builder-benchmark
Run on (12 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 
be noisy and will incur extra overhead.
---
BenchmarkTime   CPU Iterations
---
BuildInt64DictionaryArrayRandom  527063570 ns  527044023 ns  1   
485.728MB/s
BuildInt64DictionaryArraySequential  566285427 ns  566270336 ns  1   
452.081MB/s
BuildInt64DictionaryArraySimilar 762954193 ns  762332297 ns  1   
335.812MB/s
BuildStringDictionaryArray   991095766 ns  991018875 ns  1
344.63MB/s
{code}

I'm sure the result here is being heavily affected by CPU scaling but I think 
we can mitigate the impact of CPU scaling by using the `MinTime`. I find that 
adding `MinTime(1.0)` to these particular benchmarks makes them more consistent



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-5414) [C++] Using "Ninja" build system generator overrides default Release build type on Windows

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5414:
--
Labels: pull-request-available  (was: )

> [C++] Using "Ninja" build system generator overrides default Release build 
> type on Windows
> --
>
> Key: ARROW-5414
> URL: https://issues.apache.org/jira/browse/ARROW-5414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
>
> Ran into this infuriating issue today. See gist
> https://gist.github.com/wesm/c3dd87279ec20b2f2d12665fd264bfef
> The cmake invocation that produces this is
> {code}
> cmake -G "Ninja" ^
>   -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
>   -DARROW_BUILD_TESTS=on ^
>   -DARROW_CXXFLAGS="/WX /MP" ^
>   -DARROW_GANDIVA=on ^
>   -DARROW_PARQUET=on ^
>   -DARROW_PYTHON=on ^
>   ..
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6096) [C++] Conditionally depend on boost regex library

2019-08-01 Thread Hatem Helal (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated ARROW-6096:
---
Summary: [C++] Conditionally depend on boost regex library  (was: [C++] 
Remove dependency on boost regex library)

I made an initial stab at this in the linked PR.  

> [C++] Conditionally depend on boost regex library
> -
>
> Key: ARROW-6096
> URL: https://issues.apache.org/jira/browse/ARROW-6096
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There appears to be only one place where the boost regex library is used:
> [cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32]
> I think this can be replaced by the C++11 regex library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6096) [C++] Remove dependency on boost regex library

2019-08-01 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6096:
--
Labels: pull-request-available  (was: )

> [C++] Remove dependency on boost regex library
> --
>
> Key: ARROW-6096
> URL: https://issues.apache.org/jira/browse/ARROW-6096
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>
> There appears to be only one place where the boost regex library is used:
> [cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32]
> I think this can be replaced by the C++11 regex library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (ARROW-6061) [C++] Cannot build libarrow without rapidjson

2019-08-01 Thread Francois Saint-Jacques (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6061:
--
Component/s: C++

> [C++] Cannot build libarrow without rapidjson
> -
>
> Key: ARROW-6061
> URL: https://issues.apache.org/jira/browse/ARROW-6061
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
>  
> {code:java}
> arrow/cpp/src/arrow/json/chunker.cc:25:30:fatal error: rapidjson/reader.h: 
> No such file or directory
>  #include "rapidjson/reader.h"
> compilation terminated.{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Resolved] (ARROW-6061) [C++] Cannot build libarrow without rapidjson

2019-08-01 Thread Francois Saint-Jacques (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6061.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4962
[https://github.com/apache/arrow/pull/4962]

> [C++] Cannot build libarrow without rapidjson
> -
>
> Key: ARROW-6061
> URL: https://issues.apache.org/jira/browse/ARROW-6061
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
>  
> {code:java}
> arrow/cpp/src/arrow/json/chunker.cc:25:30:fatal error: rapidjson/reader.h: 
> No such file or directory
>  #include "rapidjson/reader.h"
> compilation terminated.{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

1 2 >

1 - 100 of 132 matches

Mail list logo