Hi, This is most likely related to this issue https://issues.apache.org/jira/browse/AVRO-1364. It is fixed in Avro 1.7.6, so first try updating your Avro-C lib.
-Mika On Jan 27, 2014, at 6:35 PM, Amrith Kumar <amr...@parelastic.com> wrote: > Here is some additional debugging information … > > I created this simple CSV file that looks thus. > > ubuntu@petest1:/mnt/avrotest$ head maketest.csv > "data1", "data2", > 0, 1804289383, > 1, 846930886, > 2, 1681692777, > 3, 1714636915, > 4, 1957747793, > 5, 424238335, > 6, 719885386, > 7, 1649760492, > 8, 596516649, > ubuntu@petest1:/mnt/avrotest$ tail maketest.csv > 499990, 1910331393, > 499991, 1091319779, > 499992, 805782879, > 499993, 1636478990, > 499994, 1827956658, > 499995, 1695362021, > 499996, 1235853180, > 499997, 208721086, > 499998, 1836333752, > 499999, 699496062, > > Nothing fancy, just 500,000 rows of data with the row number in the first > column and some random integer in the second. > > Here is the avro conversion. > > ubuntu@petest1:/mnt/avrotest$ csvtoavro -i maketest.csv -o maketest.avro > 2014-01-27 11:28:40 csvtoavro: Processed maketest.csv with 500001 rows of > data > > Since there is a header row which gets counted it says 500,001. > > Now, here is the output from avrocat > > ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | head -n 10 > {"data1": "0", "data2": " 1804289383"} > {"data1": "1", "data2": " 846930886"} > {"data1": "2", "data2": " 1681692777"} > {"data1": "3", "data2": " 1714636915"} > {"data1": "4", "data2": " 1957747793"} > {"data1": "5", "data2": " 424238335"} > {"data1": "6", "data2": " 719885386"} > {"data1": "7", "data2": " 1649760492"} > {"data1": "8", "data2": " 596516649"} > {"data1": "9", "data2": " 1189641421"} > ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | tail -n 10 > {"data1": "499944", "data2": " 929606694"} > {"data1": "499945", "data2": " 973636875"} > {"data1": "499946", "data2": " 1942285618"} > {"data1": "499947", "data2": " 2089133167"} > {"data1": "499948", "data2": " 213614747"} > {"data1": "499949", "data2": " 599060422"} > {"data1": "499950", "data2": " 1885053377"} > {"data1": "499951", "data2": " 2100042242"} > {"data1": "499952", "data2": " 1491280709"} > {"data1": "499953", "data2": " 1103081139"} > ubuntu@petest1:/mnt/avrotest$./maketest.avro > ./maketest.avro 499954 > > For completeness, here is some data from the CSV file showing values near > around where the AVRO file appears to end. > > 499940, 1054581755, > 499941, 600032353, > 499942, 1997078786, > 499943, 1508121989, > 499944, 929606694, > 499945, 973636875, > 499946, 1942285618, > 499947, 2089133167, > 499948, 213614747, > 499949, 599060422, > 499950, 1885053377, > 499951, 2100042242, > 499952, 1491280709, > 499953, 1103081139, > 499954, 521709408, > 499955, 494574550, > 499956, 756884387, > 499957, 2035729858, > 499958, 1560742697, > 499959, 923330093, > > In other words, the last 46 rows of data appear to be missing. > > -amrith > > From: Amrith Kumar [mailto:amr...@parelastic.com] > Sent: Monday, January 27, 2014 11:23 AM > To: user@avro.apache.org > Subject: data missing in writing an AVRO file. > > Greetings, > > I’m attempting to convert some very large CSV files into AVRO format. To this > end, I wrote a csvtoavro converter using C API v1.7.5. > > The essence of the program is this: > > // initialize line counter > lineno = 0; > > // make a schema first > avro_schema_from_json_length (...); > > // make a generic class from schema > iface = avro_generic_class_from_schema( schema ); > > // get the record size and verify that it is 109 > avro_schema_record_size (schema); > > // get a generic value > avro_generic_value_new (iface, &tuple); > > // make me an output file > fp = fopen ( outputfile, "wb" ); > > // make me a filewriter > avro_file_writer_create_fp (fp, outputfile, 0, schema, &db); > > // now for the code to emit the data > > while (...) > { > avro_value_reset (&tuple); > > // get the CSV record into the tuple > ... > > // write that tuple > avro_file_writer_append_value (db, &tuple); > > lineno ++; > > // flush the file > avro_file_writer_flush (db); > } > > // close the output file > avro_file_writer_close (db); > > // other cleanup > avro_value_iface_decref (iface); > avro_value_decref (&tuple); > > // close output file > fflush (outfp); > fclose (outfp); > > I read the file using a modified version of avrocat.c that looks like this. > > wschema = avro_file_reader_get_writer_schema(reader); > iface = avro_generic_class_from_schema(wschema); > avro_generic_value_new(iface, &value); > > int rval; > lineno = 0; > > while ((rval = avro_file_reader_read_value(reader, &value)) == 0) { > lineno ++; > avro_value_reset(&value); > } > > // If it was not an EOF that caused it to fail, > // print the error. > if (rval != EOF) > { > fprintf(stderr, "Error: %s\n", avro_strerror()); > } > else > { > printf ( "%s %lld\n", filename, lineno ); > > } > > On many files, I find no data is missing in the .AVRO file. However, quite > often I get files where several dozen rows of data are missing. > > I’m certain that I’m doing something wrong, and something very basic. Any > help debugging would be most appreciated. > > Thanks, > > -amrith