[
https://issues.apache.org/jira/browse/PARQUET-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787046#comment-17787046
]
Jiashen Zhang edited comment on PARQUET-2378 at 11/17/23 6:28 AM:
------------------------------------------------------------------
What about we directly print content given a parquet file? Below is some code
sample:
{code:java}
String input = <parquet file>;
ParquetReader<SimpleRecord> reader = null;
try {
PrintWriter writer = new PrintWriter(Main.out, true);
reader = ParquetReader.builder(new SimpleReadSupport(), new
Path(input)).build();
ParquetMetadata metadata = ParquetFileReader.readFooter(new
Configuration(), new Path(input));
JsonRecordFormatter.JsonGroupFormatter formatter =
JsonRecordFormatter.fromSchema(metadata.getFileMetaData().getSchema());
for (SimpleRecord value = reader.read(); value != null; value =
reader.read()) {
value.prettyPrint(writer);
writer.println();
}
} finally {
if (reader != null) {
try {
reader.close();
} catch (Exception ex) {
}
}
}
}{code}
Output sample:
{code:java}
.......
id = 15012
category = open_qa
original-instruction = What is the difference between a road bike and a
mountain bike?
original-context =
original-response = Road bikes are built to be ridden on asphalt and cement
surfaces and have thin tires, whereas mountain bikes are built to be ridden on
dirt and have wider tires. Road bikes also have more aerodynamic handle bars
while mountain bike handle bars a built for less responsive steering while
bouncing around off the road.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = What is the difference between a road bike and a mountain bike?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item =
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = Road bikes are built to be ridden on asphalt and cement surfaces and
have thin tires, whereas mountain bikes are built to be ridden on dirt and have
wider tires. Road bikes also have more aerodynamic handle bars while mountain
bike handle bars a built for less responsive steering while bouncing around off
the road.
.status:
..list:
...item = submitted
id = 15013
category = general_qa
original-instruction = How does GIS help in the real estate investment industry?
original-context =
original-response = Real estate investors depend on precise, accurate location
intelligence for competitive insights about the markets and locations where
they do business. Real estate investment teams use GIS to bring together
location-specific data, mapping, and visualization technology. This enables
them to provide the latest insights about real estate markets and their
investments, now and in the future. Using thousands of global datasets,
investors can quickly understand how their real estate investments are
performing across town or around the world, quickly access precise local data
about real estate assets, on any device, anywhere, anytime, including
information on occupancy, building maintenance, property valuation, and
more.Real estate companies and investors use GIS to research markets, identify
new opportunities for growth and expansion, and manage their investments at the
market and neighborhood levels. They can also use GIS to create professional
digital and printed materials—such as 3D renderings and virtual
walk-throughs—to help market investments across platforms. Real estate
investors can use mobile data collection tools to gather property information
directly from the field and analyze and share insights across their
organizations in real time. Investors can leverage precise local knowledge
about their assets across geographies. GIS maps and dashboards help investors
see, in real-time, relevant data that can affect properties, and streamline
investment management with access to all relevant data about every asset in any
portfolio.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = How does GIS help in the real estate investment industry?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item =
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = Real estate investors depend on precise, accurate location
intelligence for competitive insights about the markets and locations where
they do business. Real estate investment teams use GIS to bring together
location-specific data, mapping, and visualization technology. This enables
them to provide the latest insights about real estate markets and their
investments, now and in the future. Using thousands of global datasets,
investors can quickly understand how their real estate investments are
performing across town or around the world, quickly access precise local data
about real estate assets, on any device, anywhere, anytime, including
information on occupancy, building maintenance, property valuation, and
more.Real estate companies and investors use GIS to research markets, identify
new opportunities for growth and expansion, and manage their investments at the
market and neighborhood levels. They can also use GIS to create professional
digital and printed materials—such as 3D renderings and virtual
walk-throughs—to help market investments across platforms. Real estate
investors can use mobile data collection tools to gather property information
directly from the field and analyze and share insights across their
organizations in real time. Investors can leverage precise local knowledge
about their assets across geographies. GIS maps and dashboards help investors
see, in real-time, relevant data that can affect properties, and streamline
investment management with access to all relevant data about every asset in any
portfolio.
.status:
..list:
...item = submitted
id = 15014
category = general_qa
original-instruction = What is the Masters?
original-context =
original-response = The Masters Tournament is a golf tournament held annually
in the first week of April at Augusta National Golf Club in Augusta, Georgia.
The Masters is one of four Major golf tournaments and the only one to be played
at the same course every year. The course is renowned for its iconic holes,
impeccable groundskeeping, and colorful flowers that are typically in bloom.
The winner earns a coveted Green Jacket and a lifetime invitation back to
compete. Many players and fans consider The Masters to be their favorite
tournament given these traditions and the historical moments that have occurred
in past tournaments.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = What is the Masters?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item =
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = The Masters Tournament is a golf tournament held annually in the
first week of April at Augusta National Golf Club in Augusta, Georgia. The
Masters is one of four Major golf tournaments and the only one to be played at
the same course every year. The course is renowned for its iconic holes,
impeccable groundskeeping, and colorful flowers that are typically in bloom.
The winner earns a coveted Green Jacket and a lifetime invitation back to
compete. Many players and fans consider The Masters to be their favorite
tournament given these traditions and the historical moments that have occurred
in past tournaments.
.status:
..list:
...item = submitted{code}
was (Author: JIRAUSER280855):
What about we directly print content given a parquet file? Below is some code
sample:
{code:java}
String input = <parquet file>;
ParquetReader<SimpleRecord> reader = null;
try {
PrintWriter writer = new PrintWriter(Main.out, true);
reader = ParquetReader.builder(new SimpleReadSupport(), new
Path(input)).build();
ParquetMetadata metadata = ParquetFileReader.readFooter(new
Configuration(), new Path(input));
JsonRecordFormatter.JsonGroupFormatter formatter =
JsonRecordFormatter.fromSchema(metadata.getFileMetaData().getSchema());
for (SimpleRecord value = reader.read(); value != null; value =
reader.read()) {
value.prettyPrint(writer);
writer.println();
}
} finally {
if (reader != null) {
try {
reader.close();
} catch (Exception ex) {
}
}
}
}{code}
Output sample:
{code:java}
.......
id = 15013
category = general_qa
original-instruction = How does GIS help in the real estate investment industry?
original-context =
original-response = Real estate investors depend on precise, accurate location
intelligence for competitive insights about the markets and locations where
they do business. Real estate investment teams use GIS to bring together
location-specific data, mapping, and visualization technology. This enables
them to provide the latest insights about real estate markets and their
investments, now and in the future. Using thousands of global datasets,
investors can quickly understand how their real estate investments are
performing across town or around the world, quickly access precise local data
about real estate assets, on any device, anywhere, anytime, including
information on occupancy, building maintenance, property valuation, and
more.Real estate companies and investors use GIS to research markets, identify
new opportunities for growth and expansion, and manage their investments at the
market and neighborhood levels. They can also use GIS to create professional
digital and printed materials—such as 3D renderings and virtual
walk-throughs—to help market investments across platforms. Real estate
investors can use mobile data collection tools to gather property information
directly from the field and analyze and share insights across their
organizations in real time. Investors can leverage precise local knowledge
about their assets across geographies. GIS maps and dashboards help investors
see, in real-time, relevant data that can affect properties, and streamline
investment management with access to all relevant data about every asset in any
portfolio.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = How does GIS help in the real estate investment industry?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item =
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = Real estate investors depend on precise, accurate location
intelligence for competitive insights about the markets and locations where
they do business. Real estate investment teams use GIS to bring together
location-specific data, mapping, and visualization technology. This enables
them to provide the latest insights about real estate markets and their
investments, now and in the future. Using thousands of global datasets,
investors can quickly understand how their real estate investments are
performing across town or around the world, quickly access precise local data
about real estate assets, on any device, anywhere, anytime, including
information on occupancy, building maintenance, property valuation, and
more.Real estate companies and investors use GIS to research markets, identify
new opportunities for growth and expansion, and manage their investments at the
market and neighborhood levels. They can also use GIS to create professional
digital and printed materials—such as 3D renderings and virtual
walk-throughs—to help market investments across platforms. Real estate
investors can use mobile data collection tools to gather property information
directly from the field and analyze and share insights across their
organizations in real time. Investors can leverage precise local knowledge
about their assets across geographies. GIS maps and dashboards help investors
see, in real-time, relevant data that can affect properties, and streamline
investment management with access to all relevant data about every asset in any
portfolio.
.status:
..list:
...item = submittedid = 15014
category = general_qa
original-instruction = What is the Masters?
original-context =
original-response = The Masters Tournament is a golf tournament held annually
in the first week of April at Augusta National Golf Club in Augusta, Georgia.
The Masters is one of four Major golf tournaments and the only one to be played
at the same course every year. The course is renowned for its iconic holes,
impeccable groundskeeping, and colorful flowers that are typically in bloom.
The winner earns a coveted Green Jacket and a lifetime invitation back to
compete. Many players and fans consider The Masters to be their favorite
tournament given these traditions and the historical moments that have occurred
in past tournaments.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = What is the Masters?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item =
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = The Masters Tournament is a golf tournament held annually in the
first week of April at Augusta National Golf Club in Augusta, Georgia. The
Masters is one of four Major golf tournaments and the only one to be played at
the same course every year. The course is renowned for its iconic holes,
impeccable groundskeeping, and colorful flowers that are typically in bloom.
The winner earns a coveted Green Jacket and a lifetime invitation back to
compete. Many players and fans consider The Masters to be their favorite
tournament given these traditions and the historical moments that have occurred
in past tournaments.
.status:
..list:
...item = submitted {code}
> Problem with a cat
> ------------------
>
> Key: PARQUET-2378
> URL: https://issues.apache.org/jira/browse/PARQUET-2378
> Project: Parquet
> Issue Type: Bug
> Reporter: Rémy Léone
> Priority: Major
> Attachments: image-2023-11-16-21-40-07-628.png
>
>
> *$* parquet cat train-00000-of-00001-15a05aeec7726f9d.parquet
>
> Unknown error
> shaded.parquet.org.apache.avro.SchemaParseException: Illegal character in:
> original-instruction
> at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1607)
> at shaded.parquet.org.apache.avro.Schema.access$400(Schema.java:92)
> at shaded.parquet.org.apache.avro.Schema$Field.<init>(Schema.java:556)
> at shaded.parquet.org.apache.avro.Schema$Field.<init>(Schema.java:595)
> at
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:295)
> at
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:279)
> at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89)
> at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405)
> at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66)
> at org.apache.parquet.cli.Main.run(Main.java:163)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.parquet.cli.Main.main(Main.java:193)
> the data set in question is:
> [https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en/tree/main/data]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)