Hello,
I updated to the latest versions of everything in the Parquet ecosystem and the
annotations in the message are coming out when reading the parquet file, so
excuse the last communication please.
Question: Can I open a Parquet Fie with an instance of FSDataInputStream
instead of Path?
What I have done was inspired from the CSV to Parquet example on GitHub. We are
using Parquet as a storage for our proprietary record format. We also are
reading existing Parquet, then translating to our proprietary record format.
In short when I open a Parquet file with File or Path, then query the footer
for the message type, using the extra info annotation, I derive a our Record
schema in the end, and then read the data from parquet into our record format
one a time. It is working well, records go in and come out, and data checks.
Below is a summary of what I am doing below:
1> The message schema:
Path parquetFilePath = ..
ParquetMetadata readFooter = null;
readFooter = ParquetFileReader.readFooter(configuration, parquetFilePath);
MessageType schema = readFooter.getFileMetaData().getSchema();
2> Then a reader :
Path path = ..
GroupReadSupport readSupport = new GroupReadSupport();
readSupport.init(configuration, null, schemaParquet);
ParquetReader<Group> reader;
try {
reader = new ParquetReader<Group>(path, readSupport);
} catch (IOException e) {
LOG.error("We can not create Parquet Reader " + e) ;
e.printStackTrace();
throw new ReadParquestFileException(e);
}
3> Get Data sequentially:
Group group;
// my record
Record dmRecord =..
…
// is there another group
if ((group = reader.read()) != null) {
for (int index = 0; index < MY_RECORD_LENGTH ; index++) {
// stuff with data
GroupType groupType = group.getType();
String fieldName = groupType.getFieldName(index);
Type type = groupType.getType(index);
if (type.isPrimitive()) {
PrimitiveType pt = (PrimitiveType) type;
PrimitiveTypeName ptn = pt.getPrimitiveTypeName();
String method = ptn.getMethod;
String primitiveName = pt.getName();
OriginalType originalType = type.getOriginalType();
switch (method) {
case "getBoolean":
Boolean valueBoolean = group.getBoolean(index, 0);
dmRecord.set(index, valueBoolean);
break;
case "getFloat":
Float valueFloat = group.getFloat(index, 0);
break;
case "getDouble":
Double valueDouble = group.getDouble(index, 0);
String valueToString = group.getValueToString(index, 0);
dmRecord.set(index, valueToString);
break;
case "getLong":
Long valueLong = group.getLong(index, 0);
dmRecord.set(index, valueLong);
break;
case "getBinary":
Binary valueBinary = group.getBinary(index, 0);
LOG.info("value(Binary):" + valueBinary.toString());
if (originalType == OriginalType.ENUM.UTF8) {
LOG.info("We have a String");
byte[] bytes = valueBinary.getBytes();
String valueToStringUTF8 = new String(bytes, "UTF-8");
dmRecord.set(index, valueToStringUTF8);
} else {
dmRecord.set(index, valueBinary.getBytes());
}
break;
default:
valueToString = group.getValueToString(index, 0);
dmRecord.set(index, valueToString);
break;
}
}
}
}
return dmRecord;
Question: How to I do this with the FSDataInputStream instead of Path? Seems
like Path is baked in? I have the requirement to work with FSDataInputStream
over Path and File.
Thank You
Best Regards
--
Daniel St. John
Senior Software Engineer, RedPoint Global Inc.
1515 Walnut Street | Suite 300 | Boulder, CO 80302-5429
C: +719 439 7825
Skype/email: [email protected]
www.redpoint.net<http://www.redpoint.net/>
From: "Daniel St. John"
<[email protected]<mailto:[email protected]>>
Date: Thursday, March 12, 2015 at 9:30 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: UTF8 and Parquet
Hello,
Thank You for hearing me.
I am creating a translator between Parquet and out proprietary record format
here at RedPoint. I create a Parquet file using the message definition to
define the schema for the parquet file like so:
message m {
optional int64 id;
optional binary name (UTF8);
optional binary address (UTF8);
…
}
Now the UTF8 annotation is accessed through the OriginalType information from
the type. The idea is that for for BINARY primitive type I could query the
OriginalType information to translate the binary to text.
However when I open a Parquet file that has a schema that was originally
annotated with UTF8 specifiers the Schema queried from the footer is missing
the OrginalType information. I understand that at the storage level the
annotation is mean-less, but at the object model layer it is critical for the
proper translation to our types.
Thank You, Regards
Daniel
--
Daniel St. John
Senior Software Engineer, RedPoint Global Inc.
1515 Walnut Street | Suite 300 | Boulder, CO 80302-5429
C: +719 439 7825
Skype/email: [email protected]<mailto:[email protected]>
www.redpoint.net<http://www.redpoint.net/>