;d like to figure out what the next steps should be. Options:
-
Get BigQuery to output the currently expected Python objects in Ibis,
-
Change Ibis to expect more Arrow-aligned types for complex types, or
-
Update the Ibis tests to accept either Python objects or the output of
Arrow
FWIW, I found the Column class to be confusing in Python. It felt redundant
/ unneeded to actually create Tables.
On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney wrote:
> On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou wrote:
> >
> >
> > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > >
> > > I'm
> Let me know if you want to collaborate on it.
Thanks Micah.
What are your thoughts on reading schemaless Avro bytes? One of the reasons
I have started experimenting with the fork is that fastavro had trouble
reading more than one row at a time from a schemaless reader.
* • **Tim Sw
provide feedback should you choose to go
> down this route.
>
> Thanks
> Wes
>
> On Tue, Jun 11, 2019, 4:53 PM Tim Swast wrote:
>
> > Hi Arrow and Avro devs,
> >
> > I've been investigating some performance issues with the BigQuery Storage
> > API
"no" (as I suspect it is) and I don't contribute it now,
the package will be clearly identified as a fork of the Apache Avro project
and licensed Apache 2.0, so it should be easy to pull in once the
techniques are proven.
* • **Tim Swast*
* • *Software Friendliness Engineer
* • *Google Cloud Developer Relations
* • *Seattle, WA, USA
Tim Swast created ARROW-5450:
Summary: [Python] TimestampArray.to_pylist() fails with
OverflowError: Python int too large to convert to C long
Key: ARROW-5450
URL: https://issues.apache.org/jira/browse/ARROW-5450
mark fields as required / non-nullable in parquet files?
If there is, is there a way to set that option with
pyarrow.parquet.write_table?
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
* • **Tim Swast*
* • *Software Friendliness Eng
> > > > > > pickle_deserialize = timeit.timeit(lambda:
> > > > > > > > > > pickle.loads(serialized_obj),
> > > > > > > > > > >
> > number=number)
> > > > > > > > > > > pickle_results.append(pickle_deserialize)
> > > > > > > > > > > serialized_obj =
> > > > > serialize_by_arrow_array(obj_batch)
> > > > > > > > > > > arrow_deserialize = timeit.timeit(
> > > > > > > > > > > lambda: pa.deserialize(serialized_obj),
> > > > > > > number=number)
> > > > > > > > > > > arrow_results.append(arrow_deserialize)
> > > > > > > > > > > return [pickle_results, arrow_results]
> > > > > > > > > > >
> > > > > > > > > > > def serialize_by_arrow_array(obj_batch):
> > > > > > > > > > > arrow_arrays = [pa.array(record) if not
> > > > > > isinstance(record,
> > > > > > > > > > pa.Array) else record for record in obj_batch]
> > > > > > > > > > > return pa.serialize(arrow_arrays).to_buffer()
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > plot_dir = '{}/{}'.format(dir_path,
> > > > > > > > > > datetime.datetime.now().strftime('%m%d_%H%M_%S'))
> > > > > > > > > > > if not os.path.exists(plot_dir):
> > > > > > > > > > > os.makedirs(plot_dir)
> > > > > > > > > > >
> > > > > > > > > > > def plot_time(pickle_times, arrow_times,
> batch_sizes,
> > > > > title,
> > > > > > > > > > filename):
> > > > > > > > > > > fig, ax = plt.subplots()
> > > > > > > > > > > fig.set_size_inches(10, 8)
> > > > > > > > > > >
> > > > > > > > > > > bar_width = 0.35
> > > > > > > > > > > n_groups = len(batch_sizes)
> > > > > > > > > > > index = np.arange(n_groups)
> > > > > > > > > > > opacity = 0.6
> > > > > > > > > > >
> > > > > > > > > > > plt.bar(index, pickle_times, bar_width,
> > > > > > > > > > > alpha=opacity, color='r',
> label='Pickle')
> > > > > > > > > > >
> > > > > > > > > > > plt.bar(index + bar_width, arrow_times,
> > bar_width,
> > > > > > > > > > > alpha=opacity, color='c',
> label='Arrow')
> > > > > > > > > > >
> > > > > > > > > > > plt.title(title, fontweight='bold')
> > > > > > > > > > > plt.ylabel('Time (seconds)', fontsize=10)
> > > > > > > > > > > plt.xticks(index + bar_width / 2, batch_sizes,
> > > > > > fontsize=10)
> > > > > > > > > > > plt.legend(fontsize=10, bbox_to_anchor=(1, 1))
> > > > > > > > > > > plt.tight_layout()
> > > > > > > > > > > plt.yticks(fontsize=10)
> > > > > > > > > > > plt.savefig(plot_dir + '/plot-' + filename +
> > '.png',
> > > > > > > > > format='png')
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > def plot_size(pickle_sizes, arrow_sizes,
> batch_sizes,
> > > > > title,
> > > > > > > > > > filename):
> > > > > > > > > > > fig, ax = plt.subplots()
> > > > > > > > > > > fig.set_size_inches(10, 8)
> > > > > > > > > > >
> > > > > > > > > > > bar_width = 0.35
> > > > > > > > > > > n_groups = len(batch_sizes)
> > > > > > > > > > > index = np.arange(n_groups)
> > > > > > > > > > > opacity = 0.6
> > > > > > > > > > >
> > > > > > > > > > > plt.bar(index, pickle_sizes, bar_width,
> > > > > > > > > > > alpha=opacity, color='r',
> label='Pickle')
> > > > > > > > > > >
> > > > > > > > > > > plt.bar(index + bar_width, arrow_sizes,
> > bar_width,
> > > > > > > > > > > alpha=opacity, color='c',
> label='Arrow')
> > > > > > > > > > >
> > > > > > > > > > > plt.title(title, fontweight='bold')
> > > > > > > > > > > plt.ylabel('Space (Byte)', fontsize=10)
> > > > > > > > > > > plt.xticks(index + bar_width / 2, batch_sizes,
> > > > > > fontsize=10)
> > > > > > > > > > > plt.legend(fontsize=10, bbox_to_anchor=(1, 1))
> > > > > > > > > > > plt.tight_layout()
> > > > > > > > > > > plt.yticks(fontsize=10)
> > > > > > > > > > > plt.savefig(plot_dir + '/plot-' + filename +
> > '.png',
> > > > > > > > > format='png')
> > > > > > > > > > >
> > > > > > > > > > > def get_union_obj():
> > > > > > > > > > > size = 200
> > > > > > > > > > > str_array = pa.array(['str-' + str(i) for i in
> > > > > > > range(size)])
> > > > > > > > > > > int_array =
> > pa.array(np.random.randn(size).tolist())
> > > > > > > > > > > types = pa.array([0 for _ in range(size)]+[1
> for
> > _ in
> > > > > > > > > > range(size)], type=pa.int8())
> > > > > > > > > > > offsets =
> > > > pa.array(list(range(size))+list(range(size)),
> > > > > > > > > > type=pa.int32())
> > > > > > > > > > > union_arr = pa.UnionArray.from_dense(types,
> > offsets,
> > > > > > > > > > [str_array, int_array])
> > > > > > > > > > > return union_arr
> > > > > > > > > > >
> > > > > > > > > > > test_objects_generater = [
> > > > > > > > > > > lambda: np.random.randn(500),
> > > > > > > > > > > lambda: np.random.randn(500).tolist(),
> > > > > > > > > > > lambda: get_union_obj()
> > > > > > > > > > > ]
> > > > > > > > > > >
> > > > > > > > > > > titles = [
> > > > > > > > > > > 'numpy arrays',
> > > > > > > > > > > 'list of ints',
> > > > > > > > > > > 'union array of strings and ints'
> > > > > > > > > > > ]
> > > > > > > > > > >
> > > > > > > > > > > def plot_benchmark():
> > > > > > > > > > > batch_sizes = list(OrderedDict.fromkeys(int(i)
> > for i
> > > > in
> > > > > > > > > > np.geomspace(1, 1000, num=25)))
> > > > > > > > > > > for i in range(len(test_objects_generater)):
> > > > > > > > > > > batches = [[test_objects_generater[i]() for
> > _ in
> > > > > > > > > > range(batch_size)] for batch_size in batch_sizes]
> > > > > > > > > > > ser_result = benchmark_ser(batches=batches)
> > > > > > > > > > > plot_time(*ser_result[0:2], batch_sizes,
> > > > > > > 'serialization: '
> > > > > > > > > > + titles[i], 'ser_time'+str(i))
> > > > > > > > > > > plot_size(*ser_result[2:], batch_sizes,
> > > > > > 'serialization
> > > > > > > > > > byte size: ' + titles[i], 'ser_size'+str(i))
> > > > > > > > > > > deser = benchmark_deser(batches=batches)
> > > > > > > > > > > plot_time(*deser, batch_sizes,
> > 'deserialization:
> > > > '
> > > > > +
> > > > > > > > > > titles[i], 'deser_time-'+str(i))
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > if __name__ == "__main__":
> > > > > > > > > > > plot_benchmark()
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Question
> > > > > > > > > > >
> > > > > > > > > > > So if i want to use arrow as data serialization
> > > > framework
> > > > > in
> > > > > > > > > > > distributed stream data processing, what's the
> right
> > way?
> > > > > > > > > > > Since streaming processing is a widespread scenario
> > in
> > > > > > > > > > data processing,
> > > > > > > > > > > framework such as flink, spark structural streaming
> > is
> > > > > > becoming
> > > > > > > > > > more and
> > > > > > > > > > > more popular. Is there a possibility to add special
> > > > support
> > > > > > > > > > > for streaming processing in arrow, such that we can
> > also
> > > > > > > benefit
> > > > > > > > > from
> > > > > > > > > > > cross-language and efficient memory layout.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>
--
* • **Tim Swast*
* • *Software Friendliness Engineer
* • *Google Cloud Developer Relations
* • *Seattle, WA, USA
> > > > > > > > pickle_deserialize = timeit.timeit(lambda:
> > > > > > > > > > pickle.loads(serialized_obj),
> > > > > > > > > > >
> > number=number)
> > > > > > > > >
Tim Swast created ARROW-4965:
Summary: [Python] Timestamp array type detection should use tzname
of datetime.datetime objects
Key: ARROW-4965
URL: https://issues.apache.org/jira/browse/ARROW-4965
Project
10 matches
Mail list logo