[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-25348. ----------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24354 [https://github.com/apache/spark/pull/24354] > Data source for binary files > ---------------------------- > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: SQL > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Weichen Xu > Priority: Major > Fix For: 3.0.0 > > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binaryFile" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > ** path: StringType > ** modificationTime: Timestamp > ** length: LongType (size limit 2GB) > Options: > * pathGlobFilter: only include files with path matching the glob pattern > Input partition size can be controlled by common SQL confs: maxPartitionBytes > and openCostInBytes -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org