Aligning intended target types for lists and structs when converting to pandas DataFrame

2020-09-10 Thread Tim Swast
;d like to figure out what the next steps should be. Options: - Get BigQuery to output the currently expected Python objects in Ibis, - Change Ibis to expect more Arrow-aligned types for complex types, or - Update the Ibis tests to accept either Python objects or the output of Arrow&#x

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Tim Swast
FWIW, I found the Column class to be confusing in Python. It felt redundant / unneeded to actually create Tables. On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney wrote: > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou wrote: > > > > > > Le 08/07/2019 à 23:17, Wes McKinney a écrit : > > > > > > I'm

Re: Avro to Arrow?

2019-06-12 Thread Tim Swast
> Let me know if you want to collaborate on it. Thanks Micah. What are your thoughts on reading schemaless Avro bytes? One of the reasons I have started experimenting with the fork is that fastavro had trouble reading more than one row at a time from a schemaless reader. * • **Tim Sw

Re: Avro to Arrow?

2019-06-11 Thread Tim Swast
provide feedback should you choose to go > down this route. > > Thanks > Wes > > On Tue, Jun 11, 2019, 4:53 PM Tim Swast wrote: > > > Hi Arrow and Avro devs, > > > > I've been investigating some performance issues with the BigQuery Storage > > API

Avro to Arrow?

2019-06-11 Thread Tim Swast
"no" (as I suspect it is) and I don't contribute it now, the package will be clearly identified as a fork of the Apache Avro project and licensed Apache 2.0, so it should be easy to pull in once the techniques are proven. * • **Tim Swast* * • *Software Friendliness Engineer * • *Google Cloud Developer Relations * • *Seattle, WA, USA

[jira] [Created] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-05-30 Thread Tim Swast (JIRA)
Tim Swast created ARROW-5450: Summary: [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long Key: ARROW-5450 URL: https://issues.apache.org/jira/browse/ARROW-5450

[Python] Is there a way to specify a column as non-nullable with parquet.write_table?

2019-05-23 Thread Tim Swast
mark fields as required / non-nullable in parquet files? If there is, is there a way to set that option with pyarrow.parquet.write_table? https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table * • **Tim Swast* * • *Software Friendliness Eng

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-09 Thread Tim Swast
> > > > > > pickle_deserialize = timeit.timeit(lambda: > > > > > > > > > > pickle.loads(serialized_obj), > > > > > > > > > > > > > number=number) > > > > > > > > > > > pickle_results.append(pickle_deserialize) > > > > > > > > > > > serialized_obj = > > > > > serialize_by_arrow_array(obj_batch) > > > > > > > > > > > arrow_deserialize = timeit.timeit( > > > > > > > > > > > lambda: pa.deserialize(serialized_obj), > > > > > > > number=number) > > > > > > > > > > > arrow_results.append(arrow_deserialize) > > > > > > > > > > > return [pickle_results, arrow_results] > > > > > > > > > > > > > > > > > > > > > > def serialize_by_arrow_array(obj_batch): > > > > > > > > > > > arrow_arrays = [pa.array(record) if not > > > > > > isinstance(record, > > > > > > > > > > pa.Array) else record for record in obj_batch] > > > > > > > > > > > return pa.serialize(arrow_arrays).to_buffer() > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > plot_dir = '{}/{}'.format(dir_path, > > > > > > > > > > datetime.datetime.now().strftime('%m%d_%H%M_%S')) > > > > > > > > > > > if not os.path.exists(plot_dir): > > > > > > > > > > > os.makedirs(plot_dir) > > > > > > > > > > > > > > > > > > > > > > def plot_time(pickle_times, arrow_times, > batch_sizes, > > > > > title, > > > > > > > > > > filename): > > > > > > > > > > > fig, ax = plt.subplots() > > > > > > > > > > > fig.set_size_inches(10, 8) > > > > > > > > > > > > > > > > > > > > > > bar_width = 0.35 > > > > > > > > > > > n_groups = len(batch_sizes) > > > > > > > > > > > index = np.arange(n_groups) > > > > > > > > > > > opacity = 0.6 > > > > > > > > > > > > > > > > > > > > > > plt.bar(index, pickle_times, bar_width, > > > > > > > > > > > alpha=opacity, color='r', > label='Pickle') > > > > > > > > > > > > > > > > > > > > > > plt.bar(index + bar_width, arrow_times, > > bar_width, > > > > > > > > > > > alpha=opacity, color='c', > label='Arrow') > > > > > > > > > > > > > > > > > > > > > > plt.title(title, fontweight='bold') > > > > > > > > > > > plt.ylabel('Time (seconds)', fontsize=10) > > > > > > > > > > > plt.xticks(index + bar_width / 2, batch_sizes, > > > > > > fontsize=10) > > > > > > > > > > > plt.legend(fontsize=10, bbox_to_anchor=(1, 1)) > > > > > > > > > > > plt.tight_layout() > > > > > > > > > > > plt.yticks(fontsize=10) > > > > > > > > > > > plt.savefig(plot_dir + '/plot-' + filename + > > '.png', > > > > > > > > > format='png') > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > def plot_size(pickle_sizes, arrow_sizes, > batch_sizes, > > > > > title, > > > > > > > > > > filename): > > > > > > > > > > > fig, ax = plt.subplots() > > > > > > > > > > > fig.set_size_inches(10, 8) > > > > > > > > > > > > > > > > > > > > > > bar_width = 0.35 > > > > > > > > > > > n_groups = len(batch_sizes) > > > > > > > > > > > index = np.arange(n_groups) > > > > > > > > > > > opacity = 0.6 > > > > > > > > > > > > > > > > > > > > > > plt.bar(index, pickle_sizes, bar_width, > > > > > > > > > > > alpha=opacity, color='r', > label='Pickle') > > > > > > > > > > > > > > > > > > > > > > plt.bar(index + bar_width, arrow_sizes, > > bar_width, > > > > > > > > > > > alpha=opacity, color='c', > label='Arrow') > > > > > > > > > > > > > > > > > > > > > > plt.title(title, fontweight='bold') > > > > > > > > > > > plt.ylabel('Space (Byte)', fontsize=10) > > > > > > > > > > > plt.xticks(index + bar_width / 2, batch_sizes, > > > > > > fontsize=10) > > > > > > > > > > > plt.legend(fontsize=10, bbox_to_anchor=(1, 1)) > > > > > > > > > > > plt.tight_layout() > > > > > > > > > > > plt.yticks(fontsize=10) > > > > > > > > > > > plt.savefig(plot_dir + '/plot-' + filename + > > '.png', > > > > > > > > > format='png') > > > > > > > > > > > > > > > > > > > > > > def get_union_obj(): > > > > > > > > > > > size = 200 > > > > > > > > > > > str_array = pa.array(['str-' + str(i) for i in > > > > > > > range(size)]) > > > > > > > > > > > int_array = > > pa.array(np.random.randn(size).tolist()) > > > > > > > > > > > types = pa.array([0 for _ in range(size)]+[1 > for > > _ in > > > > > > > > > > range(size)], type=pa.int8()) > > > > > > > > > > > offsets = > > > > pa.array(list(range(size))+list(range(size)), > > > > > > > > > > type=pa.int32()) > > > > > > > > > > > union_arr = pa.UnionArray.from_dense(types, > > offsets, > > > > > > > > > > [str_array, int_array]) > > > > > > > > > > > return union_arr > > > > > > > > > > > > > > > > > > > > > > test_objects_generater = [ > > > > > > > > > > > lambda: np.random.randn(500), > > > > > > > > > > > lambda: np.random.randn(500).tolist(), > > > > > > > > > > > lambda: get_union_obj() > > > > > > > > > > > ] > > > > > > > > > > > > > > > > > > > > > > titles = [ > > > > > > > > > > > 'numpy arrays', > > > > > > > > > > > 'list of ints', > > > > > > > > > > > 'union array of strings and ints' > > > > > > > > > > > ] > > > > > > > > > > > > > > > > > > > > > > def plot_benchmark(): > > > > > > > > > > > batch_sizes = list(OrderedDict.fromkeys(int(i) > > for i > > > > in > > > > > > > > > > np.geomspace(1, 1000, num=25))) > > > > > > > > > > > for i in range(len(test_objects_generater)): > > > > > > > > > > > batches = [[test_objects_generater[i]() for > > _ in > > > > > > > > > > range(batch_size)] for batch_size in batch_sizes] > > > > > > > > > > > ser_result = benchmark_ser(batches=batches) > > > > > > > > > > > plot_time(*ser_result[0:2], batch_sizes, > > > > > > > 'serialization: ' > > > > > > > > > > + titles[i], 'ser_time'+str(i)) > > > > > > > > > > > plot_size(*ser_result[2:], batch_sizes, > > > > > > 'serialization > > > > > > > > > > byte size: ' + titles[i], 'ser_size'+str(i)) > > > > > > > > > > > deser = benchmark_deser(batches=batches) > > > > > > > > > > > plot_time(*deser, batch_sizes, > > 'deserialization: > > > > ' > > > > > + > > > > > > > > > > titles[i], 'deser_time-'+str(i)) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > if __name__ == "__main__": > > > > > > > > > > > plot_benchmark() > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Question > > > > > > > > > > > > > > > > > > > > > > So if i want to use arrow as data serialization > > > > framework > > > > > in > > > > > > > > > > > distributed stream data processing, what's the > right > > way? > > > > > > > > > > > Since streaming processing is a widespread scenario > > in > > > > > > > > > > data processing, > > > > > > > > > > > framework such as flink, spark structural streaming > > is > > > > > > becoming > > > > > > > > > > more and > > > > > > > > > > > more popular. Is there a possibility to add special > > > > support > > > > > > > > > > > for streaming processing in arrow, such that we can > > also > > > > > > > benefit > > > > > > > > > from > > > > > > > > > > > cross-language and efficient memory layout. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- * • **Tim Swast* * • *Software Friendliness Engineer * • *Google Cloud Developer Relations * • *Seattle, WA, USA

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-09 Thread Tim Swast
> > > > > > > > pickle_deserialize = timeit.timeit(lambda: > > > > > > > > > > pickle.loads(serialized_obj), > > > > > > > > > > > > > number=number) > > > > > > > > >

[jira] [Created] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects

2019-03-19 Thread Tim Swast (JIRA)
Tim Swast created ARROW-4965: Summary: [Python] Timestamp array type detection should use tzname of datetime.datetime objects Key: ARROW-4965 URL: https://issues.apache.org/jira/browse/ARROW-4965 Project