[
https://issues.apache.org/jira/browse/ORC-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087660#comment-16087660
]
ASF GitHub Bot commented on ORC-199:
------------------------------------
Github user spasam commented on a diff in the pull request:
https://github.com/apache/orc/pull/131#discussion_r127508667
--- Diff: java/tools/src/java/org/apache/orc/tools/convert/ConvertTool.java
---
@@ -18,53 +18,178 @@
package org.apache.orc.tools.convert;
import org.apache.commons.cli.CommandLine;
-import org.apache.commons.cli.GnuParser;
+import org.apache.commons.cli.DefaultParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.OrcFile;
+import org.apache.orc.Reader;
import org.apache.orc.RecordReader;
import org.apache.orc.TypeDescription;
import org.apache.orc.Writer;
import org.apache.orc.tools.json.JsonSchemaFinder;
import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.zip.GZIPInputStream;
/**
- * A conversion tool to convert JSON files into ORC files.
+ * A conversion tool to convert CSV or JSON files into ORC files.
*/
public class ConvertTool {
+ private final List<FileInformation> fileList;
+ private final TypeDescription schema;
+ private final char csvSeparator;
+ private final char csvQuote;
+ private final char csvEscape;
+ private final int csvHeaderLines;
+ private final String csvNullString;
+ private final Writer writer;
+ private final VectorizedRowBatch batch;
- static TypeDescription computeSchema(String[] filename) throws
IOException {
+ TypeDescription buildSchema(List<FileInformation> files,
+ Configuration conf) throws IOException {
JsonSchemaFinder schemaFinder = new JsonSchemaFinder();
- for(String file: filename) {
- System.err.println("Scanning " + file + " for schema");
- schemaFinder.addFile(file);
+ for(FileInformation file: files) {
+ if (file.format == Format.JSON) {
+ System.err.println("Scanning " + file.path + " for schema");
+
schemaFinder.addFile(file.getReader(file.filesystem.open(file.path)));
+ } else if (file.format == Format.ORC) {
+ System.err.println("Merging schema from " + file.path);
+ Reader reader = OrcFile.createReader(file.path,
+ OrcFile.readerOptions(conf)
+ .filesystem(file.filesystem));
+ schemaFinder.addSchema(reader.getSchema());
+ }
}
return schemaFinder.getSchema();
--- End diff --
This is throwing NPE if no command line arguments are specified except for
CSV file:
```
Exception in thread "main" java.lang.NullPointerException
at
org.apache.orc.tools.json.JsonSchemaFinder.getSchema(JsonSchemaFinder.java:321)
at
org.apache.orc.tools.convert.ConvertTool.buildSchema(ConvertTool.java:75)
```
> Include a CSV to ORC converter
> ------------------------------
>
> Key: ORC-199
> URL: https://issues.apache.org/jira/browse/ORC-199
> Project: ORC
> Issue Type: New Feature
> Reporter: Carter Shanklin
> Assignee: Owen O'Malley
>
> It will be good to have a utility to convert CSV to ORC in a way that doesn't
> require any complex setup.
> To get things started I've created
> https://github.com/cartershanklin/csv-to-orc which uses ORC core and OpenCSV
> (which is Apache licensed).
> If there's interest it might be better to fold this into the ORC project to
> make it easier for users to find.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)