well, testing with the java avro-tools was my very next suggestion. :/

Can you make a redacted version of the schema?

On Tue, Oct 27, 2015 at 1:22 PM, web user <webuser1...@gmail.com> wrote:
> Unfortunately the company I work at has a strict policy about sharing data.
> Having said that I don't think the file is corrupted.
>
> I ran the following command:
>
> java -jar avro-tools-1.7.7.jar tojson testdata.avro
>
> and it generates a file of 1 byte
>
> I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it
> gets back the correct schema.
>
> Is there any way when using the python library for it not to have consume
> all memory on the entire box?
>
> Regards,
>
> WU
>
>
>
> On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <bus...@cloudera.com> wrote:
>>
>> It sounds like the file you are reading is malformed. Could you share
>> the file or how it was written?
>>
>> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1...@gmail.com> wrote:
>> > I ran this in a vm with much less memory and it immediately failed with
>> > a
>> > memory error:
>> >
>> > Traceback (most recent call last):
>> >   File "testavro.py", line 31, in <module>
>> >     for r in reader:
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line
>> > 362,
>> > in next
>> >     datum = self.datum_reader.read(self.datum_decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in
>> > read
>> >     return self.read_data(self.writers_schema, self.readers_schema,
>> > decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in
>> > read_data
>> >     return self.read_record(writers_schema, readers_schema, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in
>> > read_record
>> >     field_val = self.read_data(field.type, readers_field.type, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in
>> > read_data
>> >     return self.read_array(writers_schema, readers_schema, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in
>> > read_array
>> >     for i in range(block_count):
>> > MemoryError
>> >
>> >
>> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm doing the following:
>> >>
>> >> from avro.datafile import DataFileReader
>> >> from avro.datafile import DataFileWriter
>> >> from avro.io import DatumReader
>> >> from avro.io import DatumWriter
>> >>
>> >> def OpenAvroFileToRead(avro_filename):
>> >>    DataFileReader(open(avro_filename, 'r'), DatumReader())
>> >>
>> >>
>> >> with OpenAvroFileToRead(avro_filename) as reader:
>> >>    for r in reader:
>> >>        ....
>> >>
>> >> I have an avro file which is only 500 bytes. I think there is a data
>> >> structure in there which is null or empty.
>> >>
>> >> I put in print statements before and after "for r in reader". On the
>> >> instruction, for r in reader it consumes about 400Gigs of memory before
>> >> I
>> >> have to kill the process.
>> >>
>> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1
>> >> and
>> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
>> >>
>> >> Any ideas on what is causing this?
>> >>
>> >> Regards,
>> >>
>> >> WU
>> >
>> >
>>
>>
>>
>> --
>> Sean
>
>



-- 
Sean

Reply via email to