[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2019-11-18 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977167#comment-16977167
 ] 

Micah Kornfield commented on ARROW-4890:


I agree, it should probably be on the spark size (assuming the root cause is 
hitting caps in Arrow).

> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -
>
> Key: ARROW-4890
> URL: https://issues.apache.org/jira/browse/ARROW-4890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>Reporter: Abdeali Kothari
>Priority: Major
> Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an 
> issue in Arrow.
>  Continuation from the conversation on the 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in 
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a 
> reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a 
> valid usecase where I cannot simplify this groupby in any way. I have 
> stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
>   columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
>   columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per 
> executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-11-18 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6110:
--

Assignee: (was: Micah Kornfield)

> [Java] Support LargeList Type and add integration test with C++
> ---
>
> Key: ARROW-6110
> URL: https://issues.apache.org/jira/browse/ARROW-6110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4193) [Rust] Add support for decimal data type

2019-11-18 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-4193:
--
Parent: ARROW-3690
Issue Type: Sub-task  (was: Improvement)

> [Rust] Add support for decimal data type
> 
>
> Key: ARROW-4193
> URL: https://issues.apache.org/jira/browse/ARROW-4193
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Priority: Minor
>  Labels: beginner
> Fix For: 1.0.0
>
>
> We should add {{Decimal(usize,usize)}} to DataType and add the corresponding 
> array and builder classes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7035) [R] Default arguments are unclear in write_parquet docs

2019-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7035:
--
Labels: documentation pull-request-available  (was: documentation)

> [R] Default arguments are unclear in write_parquet docs
> ---
>
> Key: ARROW-7035
> URL: https://issues.apache.org/jira/browse/ARROW-7035
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.15.0
> Environment: Ubuntu with libparquet-dev 0.15.0-1, R 3.6.1, and arrow 
> 0.15.0.
>Reporter: Karl Dunkle Werner
>Priority: Minor
>  Labels: documentation, pull-request-available
> Fix For: 1.0.0
>
>
> Thank you so much for adding support for reading and writing parquet files in 
> R! I have a few questions about the user interface and optional arguments, 
> but I want to highlight how great it is to have this useful filetype to pass 
> data back and forth.
> The defaults for the optional arguments in {{arrow::write_parquet}} aren't 
> always clear. Here were my questions after reading the help docs from 
> {{write_parquet}}:
>  * What's the default {{version}}? Should a user prefer "2.0" for new 
> projects?
>  * What are acceptable values for {{compression}}? (Answer: {{uncompressed}}, 
> {{snappy}}, {{gzip}}, {{brotli}}, {{zstd}}, or {{lz4}}.)
>  * What's the default for {{use_dictionary}}? Seems to be {{TRUE}}, at least 
> some of the time.
>  * What's the default for {{write_statistics}}? Should a user prefer {{TRUE}}?
>  * Can I assume {{allow_truncated_timestamps}} is {{FALSE}} by default?
> As someone who works in both R and Python, I was a little surprised when 
> pyarrow uses snappy compression by default, but R's default is uncompressed. 
> My preference would be having the same default arguments, but that might be a 
> fringe use-case.
> While I was digging into this, I was surprised that 
> {{ParquetReaderProperties}} is exported and documented, but 
> {{ParquetWriterProperties}} isn't. Is that intentional?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7147) [C++][Dataset] Refactor dataset's API to use Result

2019-11-18 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7147.
---
Resolution: Fixed

> [C++][Dataset] Refactor dataset's API to use Result
> --
>
> Key: ARROW-7147
> URL: https://issues.apache.org/jira/browse/ARROW-7147
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> We should make this switch before the API settles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7188) [C++][Doc] doxygen broken on master: missing param implicit_casts

2019-11-18 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7188.
---
Resolution: Fixed

> [C++][Doc] doxygen broken on master: missing param implicit_casts
> -
>
> Key: ARROW-7188
> URL: https://issues.apache.org/jira/browse/ARROW-7188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset, Documentation
>Reporter: Neal Richardson
>Assignee: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://circleci.com/gh/ursa-labs/crossbow/4991



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7094) [C++] FileSystemDataSource should use an owning pointer for fs::Filesystem

2019-11-18 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7094.
---
Resolution: Fixed

> [C++] FileSystemDataSource should use an owning pointer for fs::Filesystem
> --
>
> Key: ARROW-7094
> URL: https://issues.apache.org/jira/browse/ARROW-7094
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-6340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7204) [C++][Dataset] In expression should not require exact type match

2019-11-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7204:
--

 Summary: [C++][Dataset] In expression should not require exact 
type match
 Key: ARROW-7204
 URL: https://issues.apache.org/jira/browse/ARROW-7204
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Neal Richardson
Assignee: Ben Kietzman
 Fix For: 1.0.0


Similar to ARROW-7047. I encountered this on ARROW-7185 
(https://github.com/apache/arrow/pull/5858/files#diff-1d8a97ca966e8446ef2ae4b7b5a96ed1R125)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7185) [R][Dataset] Add bindings for IN, IS_VALID expressions

2019-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7185:
--
Labels: pull-request-available  (was: )

> [R][Dataset] Add bindings for IN, IS_VALID expressions
> --
>
> Key: ARROW-7185
> URL: https://issues.apache.org/jira/browse/ARROW-7185
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7148) [C++][Dataset] API cleanup

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7148.

Resolution: Fixed

Issue resolved by pull request 5857
[https://github.com/apache/arrow/pull/5857]

> [C++][Dataset] API cleanup
> --
>
> Key: ARROW-7148
> URL: https://issues.apache.org/jira/browse/ARROW-7148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6960) [R] Add support for more compression codecs in Windows build

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6960:
--

Assignee: Grant Nguyen

> [R] Add support for more compression codecs in Windows build
> 
>
> Key: ARROW-6960
> URL: https://issues.apache.org/jira/browse/ARROW-6960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.15.0
> Environment: Windows 10
>Reporter: Grant Nguyen
>Assignee: Grant Nguyen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When I attempt to write a parquet file using lz4, zstd, or brotli compression 
> using R arrow 0.15.0, I am unable to do so due to the codec support not being 
> built (example below).
>  
> {code:java}
> > arrow::write_parquet(payout_strategy, sink = 
> > "records_test_lz4.parquet",compression = "lz4")
> Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
>  Arrow error: IOError: Arrow error: NotImplemented: LZ4 codec support not 
> built{code}
>  
> I believe that the error is generated through 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L124-L145],
>  but I am not sure how to call 
> {code:java}
> install.packages("arrow"){code}
> in R to enable the ARROW_WITH_ZSTD/LZ4/BROTLI flags, or whether I should be 
> doing installing zstd separately from arrow and then doing something pre- or 
> post-install to link zstd with arrow. From 
> [https://github.com/apache/arrow/issues/1209], it appears that zstd support 
> has been added to arrow and parquet in general, and the R package readme 
> ([https://github.com/apache/arrow/tree/master/r)|https://github.com/apache/arrow/tree/master/r]
>  notes "On macOS and Windows, installing a binary package from CRAN will 
> handle Arrow's C++ dependencies for you", but I get the sense that does not 
> apply to zstd.
>  
> Is there guidance as to how to enable zstd and other compression codecs prior 
> to or after downloading the R arrow package? Could this be added to the R 
> documentation somewhere for future reference?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6960) [R] Add support for more compression codecs in Windows build

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6960.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5814
[https://github.com/apache/arrow/pull/5814]

> [R] Add support for more compression codecs in Windows build
> 
>
> Key: ARROW-6960
> URL: https://issues.apache.org/jira/browse/ARROW-6960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.15.0
> Environment: Windows 10
>Reporter: Grant Nguyen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When I attempt to write a parquet file using lz4, zstd, or brotli compression 
> using R arrow 0.15.0, I am unable to do so due to the codec support not being 
> built (example below).
>  
> {code:java}
> > arrow::write_parquet(payout_strategy, sink = 
> > "records_test_lz4.parquet",compression = "lz4")
> Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
>  Arrow error: IOError: Arrow error: NotImplemented: LZ4 codec support not 
> built{code}
>  
> I believe that the error is generated through 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression.cc#L124-L145],
>  but I am not sure how to call 
> {code:java}
> install.packages("arrow"){code}
> in R to enable the ARROW_WITH_ZSTD/LZ4/BROTLI flags, or whether I should be 
> doing installing zstd separately from arrow and then doing something pre- or 
> post-install to link zstd with arrow. From 
> [https://github.com/apache/arrow/issues/1209], it appears that zstd support 
> has been added to arrow and parquet in general, and the R package readme 
> ([https://github.com/apache/arrow/tree/master/r)|https://github.com/apache/arrow/tree/master/r]
>  notes "On macOS and Windows, installing a binary package from CRAN will 
> handle Arrow's C++ dependencies for you", but I get the sense that does not 
> apply to zstd.
>  
> Is there guidance as to how to enable zstd and other compression codecs prior 
> to or after downloading the R arrow package? Could this be added to the R 
> documentation somewhere for future reference?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2019-11-18 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976844#comment-16976844
 ] 

Bryan Cutler commented on ARROW-4890:
-

Sorry, I'm not sure of any documentation with the limits. It would be great to 
get that down somewhere and there should be a better error message for this, 
but maybe it should be done on the Spark side.

> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -
>
> Key: ARROW-4890
> URL: https://issues.apache.org/jira/browse/ARROW-4890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>Reporter: Abdeali Kothari
>Priority: Major
> Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an 
> issue in Arrow.
>  Continuation from the conversation on the 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in 
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a 
> reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a 
> valid usecase where I cannot simplify this groupby in any way. I have 
> stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
>   columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
>   columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per 
> executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [C#] Large record batch is written with negative buffer length

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7156:
---
Summary: [C#] Large record batch is written with negative buffer length  
(was: [R] [C++] Large Batches Cause Error / Crashes)

> [C#] Large record batch is written with negative buffer length
> --
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, C++
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [C#] Large record batch is written with negative buffer length

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7156:
---
Component/s: (was: C++)

> [C#] Large record batch is written with negative buffer length
> --
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7156:
---
Component/s: (was: R)
 C#

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, C++
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-18 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976841#comment-16976841
 ] 

Ben Kietzman commented on ARROW-7156:
-

I can reproduce this failure with
{code}
arrow-file-to-stream SingleBatch_String_85000_Rows.arrow > /dev/null
{code}

I can confirm that the buffer length is negative as we read from flatbuffers 
https://github.com/apache/arrow/blob/bef9a1c/cpp/src/arrow/ipc/message.cc#L159

C# writer does seem to be producing an invalid file.

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7158.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5853
[https://github.com/apache/arrow/pull/5853]

> [C++][Visual Studio]Build config Error on non English Version visual studio.
> 
>
> Key: ARROW-7158
> URL: https://issues.apache.org/jira/browse/ARROW-7158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Yiun Seungryong
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> * Build Config Error on Non English OS
>  * always show 
> {code:java}
>  Not supported MSVC compiler  {code}
>  * 
> [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44]
> There is a bug in the code below.
> {code:java}
> if(MSVC)
>  set(COMPILER_FAMILY "msvc")
>  if("${COMPILER_VERSION_FULL}" MATCHES
>  ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code}
>  * In my compiler the version display contains Korean.
> {code:java}
> Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code}
>  * Regular expression seems to need to be changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7178) [C++] Vendor forward compatible std::optional

2019-11-18 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7178:
-

Assignee: Gawain BOLTON

> [C++] Vendor forward compatible std::optional
> -
>
> Key: ARROW-7178
> URL: https://issues.apache.org/jira/browse/ARROW-7178
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Having std::optional was mentioned a few time, [~emkornfi...@gmail.com] 
> suggested https://github.com/martinmoene/optional-lite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7178) [C++] Vendor forward compatible std::optional

2019-11-18 Thread Gawain BOLTON (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976815#comment-16976815
 ] 

Gawain BOLTON commented on ARROW-7178:
--

I would have liked to assign myself this ticket, but it seems I do not have 
permission to do so.

> [C++] Vendor forward compatible std::optional
> -
>
> Key: ARROW-7178
> URL: https://issues.apache.org/jira/browse/ARROW-7178
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Having std::optional was mentioned a few time, [~emkornfi...@gmail.com] 
> suggested https://github.com/martinmoene/optional-lite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7203) [CI] Trigger GitHub Action cron workflows on demand

2019-11-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7203:
--

 Summary: [CI] Trigger GitHub Action cron workflows on demand
 Key: ARROW-7203
 URL: https://issues.apache.org/jira/browse/ARROW-7203
 Project: Apache Arrow
  Issue Type: Wish
  Components: Continuous Integration, Developer Tools
Reporter: Neal Richardson


The new GitHub Actions workflows in place of Travis-CI are great. So much 
faster feedback than before. One question we've had since migrating is how 
should Crossbow and the GHA jobs interact. GHA makes it easy enough to schedule 
nightly jobs, so we don't really need the Crossbow trickery to get the nightly 
builds to run. However, we don't yet have a solution for the other way we use 
Crossbow: triggering builds via PR comments (using ursabot). 

As it turns out, last week I was at the GitHub Universe conference and spoke to 
some of the lead devs about our experience and needs. I asked especially about 
this because the day before, I had debugged and fixed a cron workflow 
(https://issues.apache.org/jira/browse/ARROW-7164) and struggled with how to 
test that I had fixed it. (Interestingly, I happened to talk to the dev who was 
responsible for the action code change that caused our job to start failing.)

Triggering cron workflows on demand was not a feature they (or at least he) had 
considered. We brainstormed a few ways we might be able to do it. None of them 
were simple or clean. Here's what we discussed. See also the 
[docs|https://help.github.com/en/actions/automating-your-workflow-with-github-actions/events-that-trigger-workflows]
 for events that trigger workflows.

Ideally, we could use the {{issue_comment}} event to trigger a workflow, just 
as we do now via ursabot and crossbow. But there are some challenges:

* We can trigger based on an issue_comment being created, but then we'd have to 
write some code to parse the payload of the event (in the {{github.event}} 
object) and then do things only if the right build name is found. This would 
meant that for every cron workflow we have, this logic would run every time 
anyone makes any comment on any issue or PR. That might have some undesirable 
side effects.
* The issue_comment event takes the workflow script from the master branch. So 
if you're testing changes to the workflow yaml itself, you're out of luck.
* You could work around this by also conditionally running the workflow also if 
the workflow file itself is changed. 
* Because the event triggers from master, you need to modify the checkout step 
to checkout a commit/ref. Unfortunately, the commit SHA isn't present in the 
[event 
payload|https://developer.github.com/v3/activity/events/types/#issuecommentevent].
 You could probably parse it out of one of the URLs in the payload, though. 
Assuming you do, then you probably need to make this behavior conditional on 
whether you're running from an issue_comment (in which case you parse the event 
and go with master) or from cron or some other means (in which case you don't 
want to specify a ref).
* Think of all of the complexity here, and realize that unless there's some way 
to package this up into an action or template or something, we have to 
replicate this for every workflow we want to be able to trigger like this.

An alternative strategy would be to use the existing ursabot integration and 
trigger GitHub workflows from it using a [repository 
dispatch|https://developer.github.com/v3/repos/#create-a-repository-dispatch-event]
 event. The repository dispatch event would have an event_type that (somehow?) 
we would map to the workflow, and then in the client_payload we could include 
any additional build params. This would have to include the PR number or commit 
SHA because, just as with issue_comments, the workflow will run from master so 
we'll need to explicitly checkout something else. The other limitation is that 
repository dispatch requires an API token with repo write access; fortunately, 
ursabot already has had to deal with this.

A further ursabot-centric approach would extend crossbow to be able to create 
GHA workflows, and triggering on demand a workflow via crossbow (through 
ursabot comment bot or otherwise) would essentially copy the workflow to its 
repository, amending the workflow to checkout the repo and commit and to run on 
push.

In sum, it's not straightforward to do this, and as it stands now, there's a 
bit of code to write somewhere for this. 

cc [~kou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-18 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976738#comment-16976738
 ] 

Ben Kietzman commented on ARROW-7156:
-

If it's supsected that the c# writer is emitting invalid record batches, could 
you share the code which generates your test files?

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-18 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976738#comment-16976738
 ] 

Ben Kietzman edited comment on ARROW-7156 at 11/18/19 5:57 PM:
---

[~abbot] If it's supsected that the c# writer is emitting invalid record 
batches, could you share the code which generates your test files?


was (Author: bkietz):
If it's supsected that the c# writer is emitting invalid record batches, could 
you share the code which generates your test files?

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7202) [R][CI] Improve rwinlib building on CI to stop re-downloading dependencies

2019-11-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7202:
--

 Summary: [R][CI] Improve rwinlib building on CI to stop 
re-downloading dependencies
 Key: ARROW-7202
 URL: https://issues.apache.org/jira/browse/ARROW-7202
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Remove the {{--rmdeps}} arg to makepkg-mingw so that they don't get cleaned up. 
Then you can copy the relevant ones from {{/mingw64/lib}} / {{/mingw32/lib}} 
(see 
https://ci.appveyor.com/project/nealrichardson/arrow/builds/28875888?fullLog=true#L1134)
 when building the package in the followup script, rather than wget them all 
again later (and have to worry about version numbers, repositories, etc.). 

Note that relatedly, if we build libs during the arrow build (like uriparser), 
we can copy the .a files that get built from the build directory (which also 
gets deleted) to {{${MINGW_PREFIX}/lib}} so that we can package them up later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7188) [C++][Doc] doxygen broken on master: missing param implicit_casts

2019-11-18 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976727#comment-16976727
 ] 

Neal Richardson commented on ARROW-7188:


Fixing in ARROW-7148

> [C++][Doc] doxygen broken on master: missing param implicit_casts
> -
>
> Key: ARROW-7188
> URL: https://issues.apache.org/jira/browse/ARROW-7188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset, Documentation
>Reporter: Neal Richardson
>Assignee: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://circleci.com/gh/ursa-labs/crossbow/4991



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7188) [C++][Doc] doxygen broken on master: missing param implicit_casts

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7188:
--

Assignee: Neal Richardson  (was: Ben Kietzman)

> [C++][Doc] doxygen broken on master: missing param implicit_casts
> -
>
> Key: ARROW-7188
> URL: https://issues.apache.org/jira/browse/ARROW-7188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset, Documentation
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://circleci.com/gh/ursa-labs/crossbow/4991



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7094) [C++] FileSystemDataSource should use an owning pointer for fs::Filesystem

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7094:
--

Assignee: Francois Saint-Jacques  (was: Neal Richardson)

> [C++] FileSystemDataSource should use an owning pointer for fs::Filesystem
> --
>
> Key: ARROW-7094
> URL: https://issues.apache.org/jira/browse/ARROW-7094
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-6340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7188) [C++][Doc] doxygen broken on master: missing param implicit_casts

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7188:
--

Assignee: Francois Saint-Jacques  (was: Neal Richardson)

> [C++][Doc] doxygen broken on master: missing param implicit_casts
> -
>
> Key: ARROW-7188
> URL: https://issues.apache.org/jira/browse/ARROW-7188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset, Documentation
>Reporter: Neal Richardson
>Assignee: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://circleci.com/gh/ursa-labs/crossbow/4991



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7147) [C++][Dataset] Refactor dataset's API to use Result

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7147:
--

Assignee: Neal Richardson  (was: Francois Saint-Jacques)

> [C++][Dataset] Refactor dataset's API to use Result
> --
>
> Key: ARROW-7147
> URL: https://issues.apache.org/jira/browse/ARROW-7147
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> We should make this switch before the API settles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7148) [C++][Dataset] API cleanup

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7148:
--

Assignee: Francois Saint-Jacques  (was: Neal Richardson)

> [C++][Dataset] API cleanup
> --
>
> Key: ARROW-7148
> URL: https://issues.apache.org/jira/browse/ARROW-7148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7147) [C++][Dataset] Refactor dataset's API to use Result

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7147:
--

Assignee: Francois Saint-Jacques  (was: Neal Richardson)

> [C++][Dataset] Refactor dataset's API to use Result
> --
>
> Key: ARROW-7147
> URL: https://issues.apache.org/jira/browse/ARROW-7147
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> We should make this switch before the API settles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7148) [C++][Dataset] API cleanup

2019-11-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7148:
--

Assignee: Neal Richardson

> [C++][Dataset] API cleanup
> --
>
> Key: ARROW-7148
> URL: https://issues.apache.org/jira/browse/ARROW-7148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2019-11-18 Thread albertoramon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976721#comment-16976721
 ] 

albertoramon edited comment on ARROW-785 at 11/18/19 5:32 PM:
--

I saw this:(SparkSQL 2.4.4 PyArrow 0.15)

The problem is Create table with INT columns (BIGINT works properly)

SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in 
create table

 

In my case: these Parquet Files are from SSB benchmark
{code:java}
SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY)
FROM SSB.LINEORDER;
Returns: 2 20 2000
{code}
 

 

In my Column_Types I Had,: (thus I need review my Python Code :)):
{code:java}
'lo_custkey':'int64',
 'lo_partkey':'int64',
 'lo_suppkey':'int64',{code}
 

 

 

 


was (Author: albertoramon):
I saw this:(SparkSQL 2.4.4 PyArrow 0.15)

The problem is Create table with INT columns (BIGINT works properly)

SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in 
create table

 

In my case: these Parquet Files are from SSB benchmark
{code:java}
SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY)
FROM SSB.LINEORDER;
Returns: 2 20 2000
{code}
 

 

In my Column_Types I Had, thus I need review my Python Code :) :
{code:java}
'lo_custkey':'int64',
 'lo_partkey':'int64',
 'lo_suppkey':'int64',{code}
 

 

 

 

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jeff Reback
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.5.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2019-11-18 Thread albertoramon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976721#comment-16976721
 ] 

albertoramon commented on ARROW-785:


I saw this:(SparkSQL 2.4.4 PyArrow 0.15)

The problem is Create table with INT columns (BIGINT works properly)

SOL: Change INT to BIGINT works fine (I tried to use Double but didn't work) in 
create table

 

In my case: these Parquet Files are from SSB benchmark
{code:java}
SELECT MAX(LO_CUSTKEY), MAX(LO_PARTKEY), MAX (LO_SUPPKEY)
FROM SSB.LINEORDER;
Returns: 2 20 2000
{code}
 

 

In my Column_Types I Had, thus I need review my Python Code :) :
{code:java}
'lo_custkey':'int64',
 'lo_partkey':'int64',
 'lo_suppkey':'int64',{code}
 

 

 

 

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jeff Reback
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.5.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-1900) [C++] Add kernel functions for determining value range (maximum and minimum) of integer arrays

2019-11-18 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-1900.
-
Resolution: Fixed

Issue resolved by pull request 5697
[https://github.com/apache/arrow/pull/5697]

> [C++] Add kernel functions for determining value range (maximum and minimum) 
> of integer arrays
> --
>
> Key: ARROW-1900
> URL: https://issues.apache.org/jira/browse/ARROW-1900
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> These functions can be useful internally for determining when a "small range" 
> alternative to a hash table can be used for integer arrays. The maximum and 
> minimum is determined in a single scan.
> We already have infrastructure for aggregate kernels, so this would be an 
> easy addition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7201) [GLib][Gandiva] Add support for BooleanNode

2019-11-18 Thread Yosuke Shiro (Jira)
Yosuke Shiro created ARROW-7201:
---

 Summary: [GLib][Gandiva] Add support for BooleanNode
 Key: ARROW-7201
 URL: https://issues.apache.org/jira/browse/ARROW-7201
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7195) [Ruby] Improve #filter, #take, and #is_in

2019-11-18 Thread Yosuke Shiro (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-7195.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5851
[https://github.com/apache/arrow/pull/5851]

> [Ruby] Improve #filter, #take, and #is_in
> -
>
> Key: ARROW-7195
> URL: https://issues.apache.org/jira/browse/ARROW-7195
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is follow-up of 
> https://github.com/apache/arrow/pull/5837#pullrequestreview-317438583.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7200) Running Arrow Flight benchmark on two hosts doesn't work

2019-11-18 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7200:
--

 Summary: Running Arrow Flight benchmark on two hosts doesn't work
 Key: ARROW-7200
 URL: https://issues.apache.org/jira/browse/ARROW-7200
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, C++, FlightRPC
Affects Versions: 0.15.1, 0.15.0
 Environment: AWS EC2
Instance type: t3a.xlarge
AMI: ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20191002
Number of instances: 2
They are capable of pinging each other.
Reporter: Chengxin Ma
 Attachments: Screen Shot 2019-11-18 at 16.00.38.png

I was trying to evaluate the performance of Apache Arrow Flight on two hosts 
(one as the client and the other one as the server), using [the official 
benchmark|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc]].

Flags I used to build the project were:

 
{code:java}
-DARROW_FLIGHT=ON
-DCMAKE_BUILD_TYPE=Debug
-DARROW_BUILD_BENCHMARKS=ON
{code}
 

The branch I used was maint-0.15.x since there was a build error on the master 
branch. _(The build error on master only existed in the environment where I set 
up two hosts: AWS. On my local environment (macOS) the build was successful on 
the master branch. I don't think this build error is relevant to the issue 
since there is no difference in the cpp source code.)_

On the host acting as the server, I ran 
{code:java}
./arrow-flight-perf-server{code}
On the host acting as the client, I ran 
{code:java}
./arrow-flight-benchmark --server_host ip-172-31-11-18{code}
It gives the following error: 
{code:java}
Failed with error: << IOError: gRPC returned unavailable error, with message: 
Connect Failed. Detail: Unavailable{code}
 

 If I ran 
{code:java}
./arrow-flight-benchmark --server_host ip-172-31-11-17{code}
the error will be different:
{code:java}
IOError: Server was not available after 10 attempts{code}
This is understandable since this host doesn't exist at all.

This indicates that Flight is able to find the existing host (ip-172-31-11-18), 
but the communication somehow didn't succeed.

The benchmark works fine if I run it with the localhost, either by not 
specifying the server_host flag or running the server in another process on the 
same host.

I am not sure if the problem is in the environment or in the code itself. Could 
someone please give me some hint on how to resolve the problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7178) [C++] Vendor forward compatible std::optional

2019-11-18 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-7178.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5849
[https://github.com/apache/arrow/pull/5849]

> [C++] Vendor forward compatible std::optional
> -
>
> Key: ARROW-7178
> URL: https://issues.apache.org/jira/browse/ARROW-7178
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Having std::optional was mentioned a few time, [~emkornfi...@gmail.com] 
> suggested https://github.com/martinmoene/optional-lite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7148) [C++][Dataset] API cleanup

2019-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7148:
--
Labels: pull-request-available  (was: )

> [C++][Dataset] API cleanup
> --
>
> Key: ARROW-7148
> URL: https://issues.apache.org/jira/browse/ARROW-7148
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7197) [Ruby] Suppress keyword argument related warnings with Ruby 2.7

2019-11-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7197.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5855
[https://github.com/apache/arrow/pull/5855]

> [Ruby] Suppress keyword argument related warnings with Ruby 2.7
> ---
>
> Key: ARROW-7197
> URL: https://issues.apache.org/jira/browse/ARROW-7197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-7199:
---
Component/s: Java

> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> error stacktrace:
> {{(java.util.ConcurrentModificationException) null}}
> {{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}
> {{java.util.IdentityHashMap$KeyIterator.next():825}}
> {{java.util.AbstractCollection.addAll():343}}
> {{java.util.HashSet.():119}}
> {{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7199.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5856
[https://github.com/apache/arrow/pull/5856]

> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> error stacktrace:
> {{(java.util.ConcurrentModificationException) null}}
> {{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}
> {{java.util.IdentityHashMap$KeyIterator.next():825}}
> {{java.util.AbstractCollection.addAll():343}}
> {{java.util.HashSet.():119}}
> {{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-7199:

Description: 
{{}}error stacktrace:

{{(java.util.ConcurrentModificationException) null}}

{{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}

{{java.util.IdentityHashMap$KeyIterator.next():825}}

{{java.util.AbstractCollection.addAll():343}}

{{java.util.HashSet.():119}}

{{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}

  was:{{{color:#172b4d}{{  
(java.util.{color:#6554c0}ConcurrentModificationException{color}) 
{color:#0052cc}null{color}
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator{color}.{color:#172b4d}nextIndex{color}():{color:#0052cc}734{color}

java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator{color}.{color:#172b4d}next{color}():{color:#0052cc}825{color}

java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color}  
  
org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}{color}}}


> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{}}error stacktrace:
> {{(java.util.ConcurrentModificationException) null}}
> {{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}
> {{java.util.IdentityHashMap$KeyIterator.next():825}}
> {{java.util.AbstractCollection.addAll():343}}
> {{java.util.HashSet.():119}}
> {{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-7199:

Description: 
error stacktrace:

{{(java.util.ConcurrentModificationException) null}}

{{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}

{{java.util.IdentityHashMap$KeyIterator.next():825}}

{{java.util.AbstractCollection.addAll():343}}

{{java.util.HashSet.():119}}

{{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}

  was:
{{}}error stacktrace:

{{(java.util.ConcurrentModificationException) null}}

{{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}

{{java.util.IdentityHashMap$KeyIterator.next():825}}

{{java.util.AbstractCollection.addAll():343}}

{{java.util.HashSet.():119}}

{{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}


> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> error stacktrace:
> {{(java.util.ConcurrentModificationException) null}}
> {{java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex():734}}
> {{java.util.IdentityHashMap$KeyIterator.next():825}}
> {{java.util.AbstractCollection.addAll():343}}
> {{java.util.HashSet.():119}}
> {{org.apache.arrow.memory.BaseAllocator.getChildAllocators():128}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-7199:

Description: {{{color:#172b4d}{{  
(java.util.{color:#6554c0}ConcurrentModificationException{color}) 
{color:#0052cc}null{color}
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator{color}.{color:#172b4d}nextIndex{color}():{color:#0052cc}734{color}

java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator{color}.{color:#172b4d}next{color}():{color:#0052cc}825{color}

java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color}  
  
org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}{color}}}
  (was: {color:#172b4d}{{ (java.util.ConcurrentModificationException{color}) 
{color:#0052cc}null{color} 
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator.nextIndex():{color:#0052cc}734{color}
 
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator.next():{color:#0052cc}825{color}
 
java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
 java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color} 
org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}})

> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{{color:#172b4d}{{  
> (java.util.{color:#6554c0}ConcurrentModificationException{color}) 
> {color:#0052cc}null{color}
> java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator{color}.{color:#172b4d}nextIndex{color}():{color:#0052cc}734{color}
> 
> java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator{color}.{color:#172b4d}next{color}():{color:#0052cc}825{color}
> 
> java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
> 
> java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color}
> org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-7199:

Description: {color:#172b4d}{{ 
(java.util.ConcurrentModificationException{color}) {color:#0052cc}null{color} 
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator.nextIndex():{color:#0052cc}734{color}
 
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator.next():{color:#0052cc}825{color}
 
java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
 java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color} 
org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}
  (was: {color:#172b4d}{{  
(java.util.{color:#6554c0}ConcurrentModificationException{color}) 
{color:#0052cc}null{color}
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator{color}.{color:#172b4d}nextIndex{color}():{color:#0052cc}734{color}

java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator{color}.{color:#172b4d}next{color}():{color:#0052cc}825{color}

java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color}  
  
org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}{color}

{color:#172b4d}{{}}{color})

> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {color:#172b4d}{{ (java.util.ConcurrentModificationException{color}) 
> {color:#0052cc}null{color} 
> java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator.nextIndex():{color:#0052cc}734{color}
>  
> java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator.next():{color:#0052cc}825{color}
>  
> java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
>  java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color} 
> org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-7199:

Description: 
{color:#172b4d}{{  
(java.util.{color:#6554c0}ConcurrentModificationException{color}) 
{color:#0052cc}null{color}
java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator{color}.{color:#172b4d}nextIndex{color}():{color:#0052cc}734{color}

java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator{color}.{color:#172b4d}next{color}():{color:#0052cc}825{color}

java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color}  
  
org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}{color}

{color:#172b4d}{{}}{color}

> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {color:#172b4d}{{  
> (java.util.{color:#6554c0}ConcurrentModificationException{color}) 
> {color:#0052cc}null{color}
> java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}IdentityHashMapIterator{color}.{color:#172b4d}nextIndex{color}():{color:#0052cc}734{color}
> 
> java.util.{color:#6554c0}IdentityHashMap{color}${color:#6554c0}KeyIterator{color}.{color:#172b4d}next{color}():{color:#0052cc}825{color}
> 
> java.util.{color:#6554c0}AbstractCollection{color}.{color:#172b4d}addAll{color}():{color:#0052cc}343{color}
> 
> java.util.{color:#6554c0}HashSet{color}.():{color:#0052cc}119{color}
> org.apache.arrow.memory.{color:#6554c0}BaseAllocator{color}.{color:#172b4d}getChildAllocators{color}():{color:#0052cc}128{color}}}{color}
> {color:#172b4d}{{}}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7199:
--
Labels: pull-request-available  (was: )

> [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators
> ---
>
> Key: ARROW-7199
> URL: https://issues.apache.org/jira/browse/ARROW-7199
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-18 Thread Bogdan Klichuk (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976379#comment-16976379
 ] 

Bogdan Klichuk commented on ARROW-7150:
---

Yeah its not that simple on my end. It's just one of json datasets inside of 
the table.

I need to save any kinda of json/csv without any type inferences, that means 
all kinds of csv/json, these are provided by user and should be saved as is 
without parsing for logging purpose and evidence of exact original data. So i 
will end up dealing with big strings up to 500mb each.

> [Python] Explain parquet file size growth
> -
>
> Key: ARROW-7150
> URL: https://issues.apache.org/jira/browse/ARROW-7150
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Mac OS X
>Reporter: Bogdan Klichuk
>Priority: Major
> Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I 
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
>  {code}
>   
> Compression works better on bigger files. How come 10x1 increase with 
> repeated data resulted in 60x growth of file? Insane imo.
>  
> Working on a periodic job that concats smaller files into bigger ones and 
> doubting now whether I need this.
>  
> I attached 820.parquet to try out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7199) [Java] ConcurrentModificationException in BaseAllocator::getChildAllocators

2019-11-18 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-7199:
---

 Summary: [Java] ConcurrentModificationException in 
BaseAllocator::getChildAllocators
 Key: ARROW-7199
 URL: https://issues.apache.org/jira/browse/ARROW-7199
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7042) Discuss for ARM supporting and ARM CI

2019-11-18 Thread zhao bo (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976340#comment-16976340
 ] 

zhao bo commented on ARROW-7042:


Hi [~uwe],

That's great if arrow team can provide R package, python packages for ARM 
ARROW. We found there is no any way to use arrow directly in ARROW DOC or some 
blogs. Such as we use ubuntu-1804, but the installation doc only gives us a x86 
apt repo. Could you / arrow team please to help to build the packages for ARM? 
Thank you very much.

> Discuss for ARM supporting and ARM CI
> -
>
> Key: ARROW-7042
> URL: https://issues.apache.org/jira/browse/ARROW-7042
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI
>Reporter: zhao bo
>Priority: Major
>
> Hi, team.
> I'm ZhaoBo from Openlab, a worker for making more opensource projects can run 
> on ARM. Here I wanna to ask some questions in ARROW community. Thanks
> 1. Is there any plan for supporting ARM release in community?
> 2. And I found  
> [https://github.com/apache/arrow/pull/3010#issuecomment-441091047|https://github.com/apache/arrow/pull/3010#issuecomment-441091047)]
>  that [~guyuqi] proposed, what status about the ARM support? Could you please 
> tell some details to me?
> 3. I found current CI system doesn't integrate any ARM test. If community 
> wants to make test on ARM work, what's the plan? Using the travis-ci for ARM 
> support or other ways?
>  
> Also, that's pretty great if [~guyuqi] could tell me the status and the 
> following plan. ;) . This issue will trace the whole work flow for ARM. Thank 
> you
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)