Re: UTF8 and Parquet

Daniel St. John Thu, 26 Mar 2015 12:38:24 -0700

Hello,

I updated to the latest versions of everything in the Parquet ecosystem and the 
annotations in the message are coming out when reading the parquet file, so 
excuse the last communication please.


Question: Can I open a Parquet Fie with an instance of FSDataInputStream 
instead of Path?

What I have done was inspired from the CSV to Parquet example on GitHub. We are 
using Parquet as a storage for our proprietary record format. We also are 
reading existing Parquet, then translating to our proprietary record format.  
In short when I open a Parquet file with File or Path, then query the footer 
for the message type, using the extra info annotation, I derive a our Record 
schema in the end, and then read the data from parquet into our record format 
one a time. It is working well, records go in and come out, and data checks. 
Below is a summary of what I am doing below:

1> The message schema:

 Path parquetFilePath = ..

ParquetMetadata readFooter = null;

readFooter = ParquetFileReader.readFooter(configuration, parquetFilePath);

MessageType schema = readFooter.getFileMetaData().getSchema();



2> Then a reader :


Path path = ..

GroupReadSupport readSupport = new GroupReadSupport();

readSupport.init(configuration, null, schemaParquet);

ParquetReader<Group> reader;

try {

reader = new ParquetReader<Group>(path, readSupport);

} catch (IOException e) {

LOG.error("We can not create Parquet Reader " + e) ;

e.printStackTrace();

throw new ReadParquestFileException(e);

}

3> Get Data sequentially:


Group group;


// my record

Record dmRecord =..

…


// is there another group

if ((group = reader.read()) != null) {


for (int index = 0; index < MY_RECORD_LENGTH ; index++) {


// stuff with data

      GroupType groupType = group.getType();

String fieldName = groupType.getFieldName(index);

Type type = groupType.getType(index);



if (type.isPrimitive()) {

PrimitiveType pt = (PrimitiveType) type;

PrimitiveTypeName ptn = pt.getPrimitiveTypeName();

String method = ptn.getMethod;

String primitiveName = pt.getName();

OriginalType originalType = type.getOriginalType();

switch (method) {


case "getBoolean":

Boolean valueBoolean = group.getBoolean(index, 0);

dmRecord.set(index, valueBoolean);

break;

case "getFloat":

Float valueFloat = group.getFloat(index, 0);

break;

case "getDouble":

Double valueDouble = group.getDouble(index, 0);

String valueToString = group.getValueToString(index, 0);

dmRecord.set(index, valueToString);

break;

case "getLong":

Long valueLong = group.getLong(index, 0);

dmRecord.set(index, valueLong);

break;

case "getBinary":

Binary valueBinary = group.getBinary(index, 0);

LOG.info("value(Binary):" + valueBinary.toString());

if (originalType == OriginalType.ENUM.UTF8) {

LOG.info("We have a String");

byte[] bytes = valueBinary.getBytes();

String valueToStringUTF8 = new String(bytes, "UTF-8");

dmRecord.set(index, valueToStringUTF8);

} else {

dmRecord.set(index, valueBinary.getBytes());

}

break;

default:

valueToString = group.getValueToString(index, 0);

dmRecord.set(index, valueToString);

break;

}

}

}

}

return dmRecord;



Question: How to I do this with the FSDataInputStream instead of Path? Seems 
like Path is baked in? I have the requirement to work with  FSDataInputStream 
over Path and File.

Thank You
Best Regards

--
Daniel St. John
Senior Software Engineer, RedPoint Global Inc.
1515 Walnut Street | Suite 300 | Boulder, CO 80302-5429
C: +719 439 7825
Skype/email: [email protected]
www.redpoint.net<http://www.redpoint.net/>

From: "Daniel St. John" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, March 12, 2015 at 9:30 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: UTF8 and Parquet

Hello,

Thank You for hearing me.

I am creating a translator between Parquet and out proprietary record format 
here at RedPoint. I create a Parquet file using the message definition to 
define the schema  for the parquet file like so:

message m {

optional int64 id;

optional binary name (UTF8);

optional binary address (UTF8);

…

}


Now the UTF8 annotation is accessed through the OriginalType information from 
the type.  The idea is that for for BINARY primitive type I could query the 
OriginalType information to translate the binary to text.

However when I open a Parquet file that has a schema that was originally 
annotated with UTF8 specifiers the Schema queried from the footer is missing 
the OrginalType information.  I understand that at the storage level the 
annotation is mean-less, but at the object model layer it is critical for the 
proper translation to our types.

Thank You, Regards
Daniel
--
Daniel St. John
Senior Software Engineer, RedPoint Global Inc.
1515 Walnut Street | Suite 300 | Boulder, CO 80302-5429
C: +719 439 7825
Skype/email: [email protected]<mailto:[email protected]>
www.redpoint.net<http://www.redpoint.net/>

Re: UTF8 and Parquet

Reply via email to