Marshall created SPARK-6161:
-------------------------------
Summary: sqlCtx.parquetFile(dataFilePath) throws NPE when using
s3, but OK when using local filesystem
Key: SPARK-6161
URL: https://issues.apache.org/jira/browse/SPARK-6161
Project: Spark
Issue Type: Question
Components: Spark Submit
Affects Versions: 1.2.1
Environment: MacOSX 10.10, S3
Reporter: Marshall
Using some examples from Spark summit 2014 and spark1.2.1, we converted 15
pipe-separated raw text files (with on avg 100k lines) individually
to parquet file format using the following code:
JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData, XXXXRecord.class);
schemaXXXXData.registerTempTable("xxxxdata");
schemaXXXXData.saveAsParquetFile(output);
We took the results of each folder and renamed the part file to match the
original filename plus .parquet and dropped them all into one directory.
We created a java class that we then invoke using a
spark-1.2.1/bin/spark-submit command...
SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
final String dataFilePath = "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
//final String dataFilePath = inputPath;
// Create a JavaSchemaRDD from the file(s) pointed to by path
JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);
GOOD: when we run our spark app locally (specifying dataFilePath as a full
filename of ONE specific parquet on local filesystem), all is well... the
'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.
GOOD: when we run our spark app locally (specifying dataFilePath as a the
directory that contains all the parquet files), all is well... the
'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the
dataFilePath directory and proceeds.
GOOD: if we do the same thing by uploading ONE of the parquet files to s3, and
change our app to use the s3 path (giving it the full filename to ONE parquet
file), all is good - code finds the file and proceeds...
BAD: if we then upload all the parquet files to s3 and specify the s3 directory
where all the parquet files are, we get an NPE:
Exception in thread "main" java.lang.NullPointerException
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
at
org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
at
org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
at
com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)
Wondering why specifying a 'dir' works locally but not in S3...
BTW, we have done above steps using json formatted files and all four scenarios
work well.
// Create a JavaSchemaRDD from the file(s) pointed to by path
JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]