To start out, you don't need to give data. Just the redacted schema with 
pointers to the data structures you think may have the bug. Then we could read 
specific parts of the code for potential bugs. 


     On Tuesday, October 27, 2015 3:01 PM, web user <webuser1...@gmail.com> 
wrote:
   

 Python version 2. I have an avro binary file. I'm not sure how to go from the 
"bad" version to something that with retracted names, since I can't read it in 
python to begin with...


On Tue, Oct 27, 2015 at 2:56 PM, Sam Groth <sgr...@yahoo-inc.com> wrote:

Are you using version 2 or 3 of python avro? For a redacted schema, just give 
the schema with all field names and namespaces changed. If the schema is really 
long and complicated, you could just give the part that you suspect is causing 
issues.

Sam

 


     On Tuesday, October 27, 2015 1:42 PM, web user <webuser1...@gmail.com> 
wrote:
   

 No. I don't think the problem is that. The same code has worked with reading 
many many files. This particular file hit a corner case where one of the data 
structures has no records in it and it is causing a lot of grief to the python 
avro routine. It's been generated from C++ avro routines...
Regards,
WU
On Tue, Oct 27, 2015 at 2:38 PM, Sam Groth <sgr...@yahoo-inc.com> wrote:

I think you may be missing a "return" when you create your DataFileReader. I 
have always been able to read data in python using the standard methods; so I 
don't think there is a problem with the implementation. That said, the python 
implementation is significantly slower than Java or C.

Sam 


     On Tuesday, October 27, 2015 1:23 PM, web user <webuser1...@gmail.com> 
wrote:
   

 Unfortunately the company I work at has a strict policy about sharing data. 
Having said that I don't think the file is corrupted. 

I ran the following command:

java -jar avro-tools-1.7.7.jar tojson testdata.avro

and it generates a file of 1 byte

I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it gets 
back the correct schema. 

Is there any way when using the python library for it not to have consume all 
memory on the entire box?

Regards,

WU



On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <bus...@cloudera.com> wrote:

It sounds like the file you are reading is malformed. Could you share
the file or how it was written?

On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1...@gmail.com> wrote:
> I ran this in a vm with much less memory and it immediately failed with a
> memory error:
>
> Traceback (most recent call last):
>   File "testavro.py", line 31, in <module>
>     for r in reader:
>   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line 362,
> in next
>     datum = self.datum_reader.read(self.datum_decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in
> read
>     return self.read_data(self.writers_schema, self.readers_schema, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in
> read_data
>     return self.read_record(writers_schema, readers_schema, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in
> read_record
>     field_val = self.read_data(field.type, readers_field.type, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in
> read_data
>     return self.read_array(writers_schema, readers_schema, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in
> read_array
>     for i in range(block_count):
> MemoryError
>
>
> On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm doing the following:
>>
>> from avro.datafile import DataFileReader
>> from avro.datafile import DataFileWriter
>> from avro.io import DatumReader
>> from avro.io import DatumWriter
>>
>> def OpenAvroFileToRead(avro_filename):
>>    DataFileReader(open(avro_filename, 'r'), DatumReader())
>>
>>
>> with OpenAvroFileToRead(avro_filename) as reader:
>>    for r in reader:
>>        ....
>>
>> I have an avro file which is only 500 bytes. I think there is a data
>> structure in there which is null or empty.
>>
>> I put in print statements before and after "for r in reader". On the
>> instruction, for r in reader it consumes about 400Gigs of memory before I
>> have to kill the process.
>>
>> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1 and
>> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
>>
>> Any ideas on what is causing this?
>>
>> Regards,
>>
>> WU
>
>



--
Sean




   



   



  

Reply via email to