[jira] [Updated] (HADOOP-10669) Avro serialization does not flush buffered serialized values causing data lost

2014-06-08 Thread Mikhail Bernadsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Bernadsky updated HADOOP-10669:
---

Attachment: HADOOP-10669_alt.patch

 Avro serialization does not flush buffered serialized values causing data lost
 --

 Key: HADOOP-10669
 URL: https://issues.apache.org/jira/browse/HADOOP-10669
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.4.0
Reporter: Mikhail Bernadsky
 Attachments: HADOOP-10669.patch, HADOOP-10669_alt.patch


 Found this debugging Nutch. 
 MapTask serializes keys and values to the same stream, in pairs: 
 keySerializer.serialize(key); 
 . 
 valSerializer.serialize(value);
  . 
 bb.write(b0, 0, 0); 
 AvroSerializer does not flush its buffer after each serialization. So if it 
 is used for valSerializer, the values are only partially written or not 
 written at all to the output stream before the record is marked as complete 
 (the last line above).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10669) Avro serialization does not flush buffered serialized values causing data lost

2014-06-08 Thread Mikhail Bernadsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Bernadsky updated HADOOP-10669:
---

Description: 
Found this debugging Nutch. 

MapTask serializes keys and values to the same stream, in pairs: 

keySerializer.serialize(key); 
. 
valSerializer.serialize(value);
 . 
bb.write(b0, 0, 0); 

AvroSerializer does not flush its buffer after each serialization. So if it is 
used for valSerializer, the values are only partially written or not written at 
all to the output stream before the record is marked as complete (the last line 
above).

EDIT Added HADOOP-10699_all.patch. This is a less intrusive fix, as it does 
not try to flush MapTask stream. Instead, we write serialized values directly 
to MapTask stream and avoid using a buffer on avro side. 

  was:
Found this debugging Nutch. 

MapTask serializes keys and values to the same stream, in pairs: 

keySerializer.serialize(key); 
. 
valSerializer.serialize(value);
 . 
bb.write(b0, 0, 0); 

AvroSerializer does not flush its buffer after each serialization. So if it is 
used for valSerializer, the values are only partially written or not written at 
all to the output stream before the record is marked as complete (the last line 
above).


 Avro serialization does not flush buffered serialized values causing data lost
 --

 Key: HADOOP-10669
 URL: https://issues.apache.org/jira/browse/HADOOP-10669
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.4.0
Reporter: Mikhail Bernadsky
 Attachments: HADOOP-10669.patch, HADOOP-10669_alt.patch


 Found this debugging Nutch. 
 MapTask serializes keys and values to the same stream, in pairs: 
 keySerializer.serialize(key); 
 . 
 valSerializer.serialize(value);
  . 
 bb.write(b0, 0, 0); 
 AvroSerializer does not flush its buffer after each serialization. So if it 
 is used for valSerializer, the values are only partially written or not 
 written at all to the output stream before the record is marked as complete 
 (the last line above).
 EDIT Added HADOOP-10699_all.patch. This is a less intrusive fix, as it does 
 not try to flush MapTask stream. Instead, we write serialized values directly 
 to MapTask stream and avoid using a buffer on avro side. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-10669) Avro serialization does not flush buffered serialized values causing data lost

2014-06-07 Thread Mikhail Bernadsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Bernadsky updated HADOOP-10669:
---

Attachment: HADOOP-10669.patch

 Avro serialization does not flush buffered serialized values causing data lost
 --

 Key: HADOOP-10669
 URL: https://issues.apache.org/jira/browse/HADOOP-10669
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.4.0
Reporter: Mikhail Bernadsky
 Attachments: HADOOP-10669.patch


 Found this debugging Nutch. 
 MapTask serializes keys and values to the same stream, in pairs: 
 keySerializer.serialize(key); 
 . 
 valSerializer.serialize(value);
  . 
 bb.write(b0, 0, 0); 
 AvroSerializer does not flush its buffer after each serialization. So if it 
 is used for valSerializer, the values are only partially written or not 
 written at all to the output stream before the record is marked as complete 
 (the last line above).



--
This message was sent by Atlassian JIRA
(v6.2#6252)