Hi,

This is most likely related to this issue 
https://issues.apache.org/jira/browse/AVRO-1364. It is fixed in Avro 1.7.6, so 
first try updating your Avro-C lib.

-Mika

On Jan 27, 2014, at 6:35 PM, Amrith Kumar <amr...@parelastic.com> wrote:

> Here is some additional debugging information …
>  
> I created this simple CSV file that looks thus.
>  
> ubuntu@petest1:/mnt/avrotest$ head maketest.csv
> "data1", "data2",
> 0, 1804289383,
> 1, 846930886,
> 2, 1681692777,
> 3, 1714636915,
> 4, 1957747793,
> 5, 424238335,
> 6, 719885386,
> 7, 1649760492,
> 8, 596516649,
> ubuntu@petest1:/mnt/avrotest$ tail maketest.csv
> 499990, 1910331393,
> 499991, 1091319779,
> 499992, 805782879,
> 499993, 1636478990,
> 499994, 1827956658,
> 499995, 1695362021,
> 499996, 1235853180,
> 499997, 208721086,
> 499998, 1836333752,
> 499999, 699496062,
>  
> Nothing fancy, just 500,000 rows of data with the row number in the first 
> column and some random integer in the second.
>  
> Here is the avro conversion.
>  
> ubuntu@petest1:/mnt/avrotest$ csvtoavro -i maketest.csv -o maketest.avro
> 2014-01-27 11:28:40  csvtoavro: Processed maketest.csv with 500001 rows of 
> data
>  
> Since there is a header row which gets counted it says 500,001.
>  
> Now, here is the output from avrocat
>  
> ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | head -n 10
> {"data1": "0", "data2": " 1804289383"}
> {"data1": "1", "data2": " 846930886"}
> {"data1": "2", "data2": " 1681692777"}
> {"data1": "3", "data2": " 1714636915"}
> {"data1": "4", "data2": " 1957747793"}
> {"data1": "5", "data2": " 424238335"}
> {"data1": "6", "data2": " 719885386"}
> {"data1": "7", "data2": " 1649760492"}
> {"data1": "8", "data2": " 596516649"}
> {"data1": "9", "data2": " 1189641421"}
> ubuntu@petest1:/mnt/avrotest$ avrocat ./maketest.avro | tail -n 10
> {"data1": "499944", "data2": " 929606694"}
> {"data1": "499945", "data2": " 973636875"}
> {"data1": "499946", "data2": " 1942285618"}
> {"data1": "499947", "data2": " 2089133167"}
> {"data1": "499948", "data2": " 213614747"}
> {"data1": "499949", "data2": " 599060422"}
> {"data1": "499950", "data2": " 1885053377"}
> {"data1": "499951", "data2": " 2100042242"}
> {"data1": "499952", "data2": " 1491280709"}
> {"data1": "499953", "data2": " 1103081139"}
> ubuntu@petest1:/mnt/avrotest$./maketest.avro
> ./maketest.avro 499954
>  
> For completeness, here is some data from the CSV file showing values near 
> around where the AVRO file appears to end.
>  
> 499940, 1054581755,
> 499941, 600032353,
> 499942, 1997078786,
> 499943, 1508121989,
> 499944, 929606694,
> 499945, 973636875,
> 499946, 1942285618,
> 499947, 2089133167,
> 499948, 213614747,
> 499949, 599060422,
> 499950, 1885053377,
> 499951, 2100042242,
> 499952, 1491280709,
> 499953, 1103081139,
> 499954, 521709408,
> 499955, 494574550,
> 499956, 756884387,
> 499957, 2035729858,
> 499958, 1560742697,
> 499959, 923330093,
>  
> In other words, the last 46 rows of data appear to be missing.
>  
> -amrith
>  
> From: Amrith Kumar [mailto:amr...@parelastic.com] 
> Sent: Monday, January 27, 2014 11:23 AM
> To: user@avro.apache.org
> Subject: data missing in writing an AVRO file.
>  
> Greetings,
>  
> I’m attempting to convert some very large CSV files into AVRO format. To this 
> end, I wrote a csvtoavro converter using C API v1.7.5.
>  
> The essence of the program is this:
>  
> // initialize line counter
> lineno = 0;
>  
> // make a schema first
> avro_schema_from_json_length (...);
>  
> // make a generic class from schema
> iface = avro_generic_class_from_schema( schema );
>  
> // get the record size and verify that it is 109
> avro_schema_record_size (schema);
>  
> // get a generic value
> avro_generic_value_new (iface, &tuple);
>  
> // make me an output file
> fp = fopen ( outputfile, "wb" );
>  
> // make me a filewriter
> avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);
>  
> // now for the code to emit the data
>  
> while (...)
> {
>     avro_value_reset (&tuple);
>  
>     // get the CSV record into the tuple
>     ...
>  
>     // write that tuple
>     avro_file_writer_append_value (db, &tuple);
>  
>     lineno ++;
>  
>     // flush the file
>     avro_file_writer_flush (db);
> }
>  
> // close the output file
> avro_file_writer_close (db);
>  
> // other cleanup
> avro_value_iface_decref (iface);
> avro_value_decref (&tuple);
>  
> // close output file
> fflush (outfp);
> fclose (outfp);
>  
> I read the file using a modified version of avrocat.c that looks like this.
>  
> wschema = avro_file_reader_get_writer_schema(reader);
> iface = avro_generic_class_from_schema(wschema);
> avro_generic_value_new(iface, &value);
>  
> int rval;
> lineno = 0;
>  
> while ((rval = avro_file_reader_read_value(reader, &value)) == 0) {
> lineno ++;
> avro_value_reset(&value);
> }
>  
> // If it was not an EOF that caused it to fail,
> // print the error.
> if (rval != EOF) 
> {
> fprintf(stderr, "Error: %s\n", avro_strerror());
> }
> else
> {
> printf ( "%s %lld\n", filename, lineno );
>  
> }
>  
> On many files, I find no data is missing in the .AVRO file. However, quite 
> often I get files where several dozen rows of data are missing.
>  
> I’m certain that I’m doing something wrong, and something very basic. Any 
> help debugging would be most appreciated.
>  
> Thanks,
>  
> -amrith

Reply via email to