date:20171221

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300939#comment-16300939
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

jacques-n commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353517425
 
 
   I'm +1 on this approach.  It may not be perfect but it is definitely far 
better than the old approach.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300938#comment-16300938
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

jacques-n commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353517425
 
 
   I'm +1 on this approach.  It may not be perfect but it is definitely far 
better than the old appraoch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300872#comment-16300872
 ] 

ASF GitHub Bot commented on ARROW-1943:
---

siddharthteotia commented on issue #1439: ARROW-1943: [JAVA] handle 
setInitialCapacity for deeply nested lists
URL: https://github.com/apache/arrow/pull/1439#issuecomment-353504925
 
 
   Ping.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle setInitialCapacity() for deeply nested lists of lists
> 
>
> Key: ARROW-1943
> URL: https://issues.apache.org/jira/browse/ARROW-1943
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>  Labels: pull-request-available
>
> The current implementation of setInitialCapacity() uses a factor of 5 for 
> every level we go into list:
> So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) 
> and we start with an initial capacity of 128, we end up throwing 
> OversizedAllocationException from the BigIntVector because at every level we 
> increased the capacity by 5 and by the time we reached inner scalar that 
> actually stores the data, we were well over max size limit per vector (1MB).
> We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (ARROW-1944) FindArrow has wrong ARROW_STATIC_LIB

2017-12-21 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1944.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1440
[https://github.com/apache/arrow/pull/1440]

> FindArrow has wrong ARROW_STATIC_LIB
> 
>
> Key: ARROW-1944
> URL: https://issues.apache.org/jira/browse/ARROW-1944
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It seems that in
> https://github.com/apache/arrow/blob/a0555c04dd5c43230a1c50d0d0a94e06d8ad9ff0/cpp/cmake_modules/FindArrow.cmake#L100
> ARROW_PYTHON_LIB_PATH should be replaced with  ARROW_LIBS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1944) FindArrow has wrong ARROW_STATIC_LIB

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300704#comment-16300704
 ] 

ASF GitHub Bot commented on ARROW-1944:
---

wesm closed pull request #1440: ARROW-1944: [C++] Fix ARROW_STATIC_LIB in 
FindArrow
URL: https://github.com/apache/arrow/pull/1440
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/cmake_modules/FindArrow.cmake 
b/cpp/cmake_modules/FindArrow.cmake
index 12f76b6c2..bce4404a4 100644
--- a/cpp/cmake_modules/FindArrow.cmake
+++ b/cpp/cmake_modules/FindArrow.cmake
@@ -97,8 +97,8 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIBS)
 set(ARROW_SHARED_IMP_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}.lib)
 set(ARROW_PYTHON_SHARED_IMP_LIB 
${ARROW_PYTHON_LIBS}/${ARROW_PYTHON_LIB_NAME}.lib)
   else()
-set(ARROW_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/lib${ARROW_LIB_NAME}.a)
-set(ARROW_PYTHON_STATIC_LIB 
${ARROW_PYTHON_LIB_PATH}/lib${ARROW_PYTHON_LIB_NAME}.a)
+set(ARROW_STATIC_LIB ${ARROW_LIBS}/lib${ARROW_LIB_NAME}.a)
+set(ARROW_PYTHON_STATIC_LIB ${ARROW_LIBS}/lib${ARROW_PYTHON_LIB_NAME}.a)
 
 set(ARROW_SHARED_LIB 
${ARROW_LIBS}/lib${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX})
 set(ARROW_PYTHON_SHARED_LIB 
${ARROW_LIBS}/lib${ARROW_PYTHON_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX})


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FindArrow has wrong ARROW_STATIC_LIB
> 
>
> Key: ARROW-1944
> URL: https://issues.apache.org/jira/browse/ARROW-1944
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It seems that in
> https://github.com/apache/arrow/blob/a0555c04dd5c43230a1c50d0d0a94e06d8ad9ff0/cpp/cmake_modules/FindArrow.cmake#L100
> ARROW_PYTHON_LIB_PATH should be replaced with  ARROW_LIBS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (ARROW-1938) Error writing to partitioned dataset

2017-12-21 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1938:

Fix Version/s: 0.9.0

> Error writing to partitioned dataset
> 
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
> Fix For: 0.9.0
>
> Attachments: pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1938) Error writing to partitioned dataset

2017-12-21 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300700#comment-16300700
 ] 

Wes McKinney commented on ARROW-1938:
-

Marked for 0.9.0. Anything you can do to help us diagnose or reproduce this 
problem would be great

> Error writing to partitioned dataset
> 
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
> Fix For: 0.9.0
>
> Attachments: pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2017-12-21 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300696#comment-16300696
 ] 

Wes McKinney commented on ARROW-1940:
-

Marked for 0.9.0 -- we can look into it. I will say that where bytes -> unicode 
promotions are occurring it may be challenging to preserve a perfect round trip 
on Python 2 in all cases. 

> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2017-12-21 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1940:

Fix Version/s: 0.9.0

> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

2017-12-21 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1931.
-
Resolution: Fixed

Issue resolved by pull request 1433
[https://github.com/apache/arrow/pull/1433]

> [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017
> 
>
> Key: ARROW-1931
> URL: https://issues.apache.org/jira/browse/ARROW-1931
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See for example. [~Max Risuhin] do you know what is the most appropriate fix 
> (besides silencing the deprecation warning)?
> {code}
> C:\projects\arrow\cpp\build\googletest_ep-prefix\src\googletest_ep\googletest\include\gtest/internal/gtest-port.h(996):
>  warning C4996: 'std::tr1': warning STL4002: The non-Standard std::tr1 
> namespace and TR1-only machinery are deprecated and will be REMOVED. You can 
> define _SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING to acknowledge that you 
> have received this warning.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300645#comment-16300645
 ] 

ASF GitHub Bot commented on ARROW-1931:
---

wesm closed pull request #1433: ARROW-1931: [C++] Suppress C4996 deprecation 
warning in MSVC builds for now
URL: https://github.com/apache/arrow/pull/1433
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/appveyor.yml b/appveyor.yml
index e647b8b77..ea7922bf6 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -20,16 +20,13 @@ os: Visual Studio 2015
 
 environment:
   matrix:
-- JOB: "Cmake_Script_Tests"
-  GENERATOR: NMake Makefiles
-  PYTHON: "3.5"
-  ARCH: "64"
-  CONFIGURATION: "Release"
 - JOB: "Build"
-  GENERATOR: NMake Makefiles
+  GENERATOR: Visual Studio 15 2017 Win64
   PYTHON: "3.5"
   ARCH: "64"
   CONFIGURATION: "Release"
+  APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017
+  BOOST_ROOT: C:\Libraries\boost_1_64_0
 - JOB: "Build_Debug"
   GENERATOR: Visual Studio 14 2015 Win64
   PYTHON: "3.5"
@@ -49,13 +46,16 @@ environment:
   PYTHON: "3.5"
   ARCH: "64"
   CONFIGURATION: "Release"
+- JOB: "Cmake_Script_Tests"
+  GENERATOR: NMake Makefiles
+  PYTHON: "3.5"
+  ARCH: "64"
+  CONFIGURATION: "Release"
 - JOB: "Build"
-  GENERATOR: Visual Studio 15 2017 Win64
+  GENERATOR: NMake Makefiles
   PYTHON: "3.5"
   ARCH: "64"
   CONFIGURATION: "Release"
-  APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017
-  BOOST_ROOT: C:\Libraries\boost_1_64_0
 
   MSVC_DEFAULT_OPTIONS: ON
   BOOST_ROOT: C:\Libraries\boost_1_63_0
diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake 
b/cpp/cmake_modules/SetupCxxFlags.cmake
index 4e0ace0ba..97aed6b27 100644
--- a/cpp/cmake_modules/SetupCxxFlags.cmake
+++ b/cpp/cmake_modules/SetupCxxFlags.cmake
@@ -34,6 +34,14 @@ if (MSVC)
   # headers will see dllimport
   add_definitions(-DARROW_EXPORTING)
 
+  # ARROW-1931 See https://github.com/google/googletest/issues/1318
+  #
+  # This is added to CMAKE_CXX_FLAGS instead of CXX_COMMON_FLAGS since only the
+  # former is passed into the external projects
+  if (MSVC_VERSION VERSION_GREATER 1900)
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} 
/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING")
+  endif()
+
   if (CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
 # clang-cl
 set(CXX_COMMON_FLAGS "-EHsc")
@@ -56,6 +64,9 @@ if (MSVC)
   string(REPLACE "/MD" "-MT" ${c_flag} "${${c_flag}}")
 endforeach()
   endif()
+
+  # Support large object code
+  set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /bigobj")
 else()
   # Common flags set below with warning level
   set(CXX_COMMON_FLAGS "")


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017
> 
>
> Key: ARROW-1931
> URL: https://issues.apache.org/jira/browse/ARROW-1931
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See for example. [~Max Risuhin] do you know what is the most appropriate fix 
> (besides silencing the deprecation warning)?
> {code}
> C:\projects\arrow\cpp\build\googletest_ep-prefix\src\googletest_ep\googletest\include\gtest/internal/gtest-port.h(996):
>  warning C4996: 'std::tr1': warning STL4002: The non-Standard std::tr1 
> namespace and TR1-only machinery are deprecated and will be REMOVED. You can 
> define _SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING to acknowledge that you 
> have received this warning.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (ARROW-1941) Table <–> DataFrame roundtrip failing

2017-12-21 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1941:

Fix Version/s: 0.9.0

> Table <–> DataFrame roundtrip failing
> -
>
> Key: ARROW-1941
> URL: https://issues.apache.org/jira/browse/ARROW-1941
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Thomas Buhrmann
> Fix For: 0.9.0
>
>
> Although it is possible to create an Arrow table with a column containing 
> only empty lists (cast to a particular type, e.g. string), in a roundtrip 
> through pandas the original type is lost, it seems, and subsequently attempts 
> to convert to pandas then fail.
> To reproduce in PyArrow 0.8.0:
> {code}
> import pyarrow as pa
> # Create table with array of empty lists, forced to have type list(string)
> arrays = {
> 'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
> 'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
> }
> rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
> tbl = pa.Table.from_batches([rb])
> print("Schema 1 (correct):\n{}".format(tbl.schema))
> # First roundtrip changes schema
> df = tbl.to_pandas()
> tbl2 = pa.Table.from_pandas(df)
> print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))
> # Second roundtrip explodes
> df2 = tbl2.to_pandas()
> {code}
> This results in the following output:
> {code}
> Schema 1 (correct):
> c1: list
>   child 0, item: string
> c2: list
>   child 0, item: string
> Schema 2 (wrong):
> c1: list
>   child 0, item: string
> c2: list
>   child 0, item: null
> __index_level_0__: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": 
> [{"na'
> b'me": null, "field_name": null, "pandas_type": "unicode", 
> "numpy_'
> b'type": "object", "metadata": {"encoding": "UTF-8"}}], 
> "columns":'
> b' [{"name": "c1", "field_name": "c1", "pandas_type": 
> "list[unicod'
> b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", 
> "'
> b'field_name": "c2", "pandas_type": "list[float64]", 
> "numpy_type":'
> b' "object", "metadata": null}, {"name": null, "field_name": 
> "__in'
> b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", 
> "'
> b'metadata": null}], "pandas_version": "0.21.1"}'}
> ...
> > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> > null
> {code}
> I.e., the array of empty lists of strings gets converted into an array of 
> lists of type null, and in the pandas schema to lists of type float64.
> If one changes the empty lists to values of None in the creation of the 
> record batches, the roundtrip doesn't explode, but it will silently convert 
> the column to a simple column of type float (i.e. I lose the list type) in 
> pandas. This doesn't help, since other batches from the same source might 
> have non-empty lists and would end up with a different inferred schema, and 
> so can't be concatenated into a single table.
> (If this attempt at a double roundtrip seems weird, in my use case I receive 
> data from a server in RecordBatches, which I convert to pandas for 
> manipulation. I then serialize this data to disk using Arrow, and later need 
> to read it back into pandas again for further manipulation. So I need to be 
> able to go through various rounds of table->df->table->df->table etc., where 
> at any time a record batch may have columns that contain only empty lists).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300254#comment-16300254
 ] 

ASF GitHub Bot commented on ARROW-1931:
---

wesm commented on issue #1433: ARROW-1931: [C++] Suppress C4996 deprecation 
warning in MSVC builds for now
URL: https://github.com/apache/arrow/pull/1433#issuecomment-353397848
 
 
   thanks @MaxRis I just pushed changes, will merge if the build passes then


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017
> 
>
> Key: ARROW-1931
> URL: https://issues.apache.org/jira/browse/ARROW-1931
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See for example. [~Max Risuhin] do you know what is the most appropriate fix 
> (besides silencing the deprecation warning)?
> {code}
> C:\projects\arrow\cpp\build\googletest_ep-prefix\src\googletest_ep\googletest\include\gtest/internal/gtest-port.h(996):
>  warning C4996: 'std::tr1': warning STL4002: The non-Standard std::tr1 
> namespace and TR1-only machinery are deprecated and will be REMOVED. You can 
> define _SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING to acknowledge that you 
> have received this warning.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300246#comment-16300246
 ] 

ASF GitHub Bot commented on ARROW-1931:
---

MaxRis commented on issue #1433: ARROW-1931: [C++] Suppress C4996 deprecation 
warning in MSVC builds for now
URL: https://github.com/apache/arrow/pull/1433#issuecomment-353396851
 
 
   @wesm 
[this](https://github.com/MaxRis/arrow/commit/cfcd9c224aea495c2414a161ac6242bb59bf00d1)
 seems should work fine
   VS 2017 build is passed there 
https://ci.appveyor.com/project/MaxRisuhin/arrow (I've temporary moved VS2017 
Appveyor build job at first place and also some logging should be removed from 
my changes )


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017
> 
>
> Key: ARROW-1931
> URL: https://issues.apache.org/jira/browse/ARROW-1931
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See for example. [~Max Risuhin] do you know what is the most appropriate fix 
> (besides silencing the deprecation warning)?
> {code}
> C:\projects\arrow\cpp\build\googletest_ep-prefix\src\googletest_ep\googletest\include\gtest/internal/gtest-port.h(996):
>  warning C4996: 'std::tr1': warning STL4002: The non-Standard std::tr1 
> namespace and TR1-only machinery are deprecated and will be REMOVED. You can 
> define _SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING to acknowledge that you 
> have received this warning.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

2017-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299701#comment-16299701
 ] 

ASF GitHub Bot commented on ARROW-1931:
---

MaxRis commented on issue #1433: ARROW-1931: [C++] Suppress C4996 deprecation 
warning in MSVC builds for now
URL: https://github.com/apache/arrow/pull/1433#issuecomment-353288318
 
 
   @wesm sure


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017
> 
>
> Key: ARROW-1931
> URL: https://issues.apache.org/jira/browse/ARROW-1931
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See for example. [~Max Risuhin] do you know what is the most appropriate fix 
> (besides silencing the deprecation warning)?
> {code}
> C:\projects\arrow\cpp\build\googletest_ep-prefix\src\googletest_ep\googletest\include\gtest/internal/gtest-port.h(996):
>  warning C4996: 'std::tr1': warning STL4002: The non-Standard std::tr1 
> namespace and TR1-only machinery are deprecated and will be REMOVED. You can 
> define _SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING to acknowledge that you 
> have received this warning.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

[jira] [Commented] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

[jira] [Resolved] (ARROW-1944) FindArrow has wrong ARROW_STATIC_LIB

[jira] [Commented] (ARROW-1944) FindArrow has wrong ARROW_STATIC_LIB

[jira] [Updated] (ARROW-1938) Error writing to partitioned dataset

[jira] [Commented] (ARROW-1938) Error writing to partitioned dataset

[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

[jira] [Updated] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

[jira] [Resolved] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

[jira] [Updated] (ARROW-1941) Table <–> DataFrame roundtrip failing

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

[jira] [Commented] (ARROW-1931) [C++] w4996 warning due to std::tr1 failing builds on Visual Studio 2017

15 matches

Site Navigation

Mail list logo

Footer information