[
https://issues.apache.org/jira/browse/PARQUET-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788344#comment-17788344
]
Gang Wu commented on PARQUET-2378:
----------------------------------
Sorry for the late reply. I'm not sure if it is a good idea to add a new
command. Since both head and cat commands have the same issue, can we try/catch
the exception and use the JsonRecordFormatter you proposed as a fallback
solution if the avro schema conversion fails?
> Problem with a cat
> ------------------
>
> Key: PARQUET-2378
> URL: https://issues.apache.org/jira/browse/PARQUET-2378
> Project: Parquet
> Issue Type: Bug
> Reporter: Rémy Léone
> Priority: Major
> Attachments: image-2023-11-16-21-40-07-628.png
>
>
> *$* parquet cat train-00000-of-00001-15a05aeec7726f9d.parquet
>
> Unknown error
> shaded.parquet.org.apache.avro.SchemaParseException: Illegal character in:
> original-instruction
> at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1607)
> at shaded.parquet.org.apache.avro.Schema.access$400(Schema.java:92)
> at shaded.parquet.org.apache.avro.Schema$Field.<init>(Schema.java:556)
> at shaded.parquet.org.apache.avro.Schema$Field.<init>(Schema.java:595)
> at
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:295)
> at
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:279)
> at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89)
> at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405)
> at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66)
> at org.apache.parquet.cli.Main.run(Main.java:163)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.parquet.cli.Main.main(Main.java:193)
> the data set in question is:
> [https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en/tree/main/data]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)