Both data and data2 have no data. When using the tojson method from the java implementation I get a file with one byte. The original avro file is only about 500 bytes which is probably mostly just the schema.
On Tue, Oct 27, 2015 at 4:33 PM, web user <webuser1...@gmail.com> wrote: > Was the dump earlier not helpful? That identifies the exact spot where the > memory exception was happening. > > Here is the schema with the names changed: > > { > "type" : "record", > "name" : "SomeName", > "namespace" : "com.somenamespace", > "fields" : [ { > "name" : "data", > "type" : { > "type" : "array", > "items" : "bytes" > } > }, { > "name" : "data2", > "type" : { > "type" : "array", > "items" : "bytes" > } > } ] > } > > > On Tue, Oct 27, 2015 at 4:28 PM, Sam Groth <sgr...@yahoo-inc.com> wrote: > >> To start out, you don't need to give data. Just the redacted schema with >> pointers to the data structures you think may have the bug. Then we could >> read specific parts of the code for potential bugs. >> >> >> >> On Tuesday, October 27, 2015 3:01 PM, web user <webuser1...@gmail.com> >> wrote: >> >> >> Python version 2. I have an avro binary file. I'm not sure how to go from >> the "bad" version to something that with retracted names, since I can't >> read it in python to begin with... >> >> >> >> On Tue, Oct 27, 2015 at 2:56 PM, Sam Groth <sgr...@yahoo-inc.com> wrote: >> >> Are you using version 2 or 3 of python avro? For a redacted schema, just >> give the schema with all field names and namespaces changed. If the schema >> is really long and complicated, you could just give the part that you >> suspect is causing issues. >> >> >> Sam >> >> >> >> >> >> On Tuesday, October 27, 2015 1:42 PM, web user <webuser1...@gmail.com> >> wrote: >> >> >> No. I don't think the problem is that. The same code has worked with >> reading many many files. This particular file hit a corner case where one >> of the data structures has no records in it and it is causing a lot of >> grief to the python avro routine. It's been generated from C++ avro >> routines... >> >> Regards, >> >> WU >> >> On Tue, Oct 27, 2015 at 2:38 PM, Sam Groth <sgr...@yahoo-inc.com> wrote: >> >> I think you may be missing a "return" when you create your >> DataFileReader. I have always been able to read data in python using the >> standard methods; so I don't think there is a problem with the >> implementation. That said, the python implementation is significantly >> slower than Java or C. >> >> >> Sam >> >> >> >> On Tuesday, October 27, 2015 1:23 PM, web user <webuser1...@gmail.com> >> wrote: >> >> >> Unfortunately the company I work at has a strict policy about sharing >> data. Having said that I don't think the file is corrupted. >> >> I ran the following command: >> >> java -jar avro-tools-1.7.7.jar tojson testdata.avro >> >> and it generates a file of 1 byte >> >> I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it >> gets back the correct schema. >> >> Is there any way when using the python library for it not to have consume >> all memory on the entire box? >> >> Regards, >> >> WU >> >> >> >> On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <bus...@cloudera.com> wrote: >> >> It sounds like the file you are reading is malformed. Could you share >> the file or how it was written? >> >> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1...@gmail.com> wrote: >> > I ran this in a vm with much less memory and it immediately failed with >> a >> > memory error: >> > >> > Traceback (most recent call last): >> > File "testavro.py", line 31, in <module> >> > for r in reader: >> > File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line >> 362, >> > in next >> > datum = self.datum_reader.read(self.datum_decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in >> > read >> > return self.read_data(self.writers_schema, self.readers_schema, >> decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in >> > read_data >> > return self.read_record(writers_schema, readers_schema, decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in >> > read_record >> > field_val = self.read_data(field.type, readers_field.type, decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in >> > read_data >> > return self.read_array(writers_schema, readers_schema, decoder) >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in >> > read_array >> > for i in range(block_count): >> > MemoryError >> > >> > >> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1...@gmail.com> >> wrote: >> >> >> >> Hi, >> >> >> >> I'm doing the following: >> >> >> >> from avro.datafile import DataFileReader >> >> from avro.datafile import DataFileWriter >> >> from avro.io import DatumReader >> >> from avro.io import DatumWriter >> >> >> >> def OpenAvroFileToRead(avro_filename): >> >> DataFileReader(open(avro_filename, 'r'), DatumReader()) >> >> >> >> >> >> with OpenAvroFileToRead(avro_filename) as reader: >> >> for r in reader: >> >> .... >> >> >> >> I have an avro file which is only 500 bytes. I think there is a data >> >> structure in there which is null or empty. >> >> >> >> I put in print statements before and after "for r in reader". On the >> >> instruction, for r in reader it consumes about 400Gigs of memory >> before I >> >> have to kill the process. >> >> >> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1 >> and >> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions. >> >> >> >> Any ideas on what is causing this? >> >> >> >> Regards, >> >> >> >> WU >> > >> > >> >> >> >> -- >> Sean >> >> >> >> >> >> >> >> >> >> >> >