well, testing with the java avro-tools was my very next suggestion. :/ Can you make a redacted version of the schema?
On Tue, Oct 27, 2015 at 1:22 PM, web user <webuser1...@gmail.com> wrote: > Unfortunately the company I work at has a strict policy about sharing data. > Having said that I don't think the file is corrupted. > > I ran the following command: > > java -jar avro-tools-1.7.7.jar tojson testdata.avro > > and it generates a file of 1 byte > > I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it > gets back the correct schema. > > Is there any way when using the python library for it not to have consume > all memory on the entire box? > > Regards, > > WU > > > > On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <bus...@cloudera.com> wrote: >> >> It sounds like the file you are reading is malformed. Could you share >> the file or how it was written? >> >> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1...@gmail.com> wrote: >> > I ran this in a vm with much less memory and it immediately failed with >> > a >> > memory error: >> > >> > Traceback (most recent call last): >> > File "testavro.py", line 31, in <module> >> > for r in reader: >> > File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line >> > 362, >> > in next >> > datum = self.datum_reader.read(self.datum_decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in >> > read >> > return self.read_data(self.writers_schema, self.readers_schema, >> > decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in >> > read_data >> > return self.read_record(writers_schema, readers_schema, decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in >> > read_record >> > field_val = self.read_data(field.type, readers_field.type, decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in >> > read_data >> > return self.read_array(writers_schema, readers_schema, decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in >> > read_array >> > for i in range(block_count): >> > MemoryError >> > >> > >> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1...@gmail.com> wrote: >> >> >> >> Hi, >> >> >> >> I'm doing the following: >> >> >> >> from avro.datafile import DataFileReader >> >> from avro.datafile import DataFileWriter >> >> from avro.io import DatumReader >> >> from avro.io import DatumWriter >> >> >> >> def OpenAvroFileToRead(avro_filename): >> >> DataFileReader(open(avro_filename, 'r'), DatumReader()) >> >> >> >> >> >> with OpenAvroFileToRead(avro_filename) as reader: >> >> for r in reader: >> >> .... >> >> >> >> I have an avro file which is only 500 bytes. I think there is a data >> >> structure in there which is null or empty. >> >> >> >> I put in print statements before and after "for r in reader". On the >> >> instruction, for r in reader it consumes about 400Gigs of memory before >> >> I >> >> have to kill the process. >> >> >> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1 >> >> and >> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions. >> >> >> >> Any ideas on what is causing this? >> >> >> >> Regards, >> >> >> >> WU >> > >> > >> >> >> >> -- >> Sean > > -- Sean