[jira] [Created] (SPARK-6161) sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using local filesystem

Marshall (JIRA) Wed, 04 Mar 2015 11:18:37 -0800

Marshall created SPARK-6161:
-------------------------------

             Summary: sqlCtx.parquetFile(dataFilePath) throws NPE when using 
s3, but OK when using local filesystem
                 Key: SPARK-6161
                 URL: https://issues.apache.org/jira/browse/SPARK-6161
             Project: Spark
          Issue Type: Question
          Components: Spark Submit
    Affects Versions: 1.2.1
         Environment: MacOSX 10.10, S3
            Reporter: Marshall



Using some examples from Spark summit 2014 and spark1.2.1, we converted 15 
pipe-separated raw text files (with on avg 100k lines) individually 
to parquet file format using the following code:

  JavaSchemaRDD schemaXXXXData = sqlCtx.applySchema(xxxxData, XXXXRecord.class);
  schemaXXXXData.registerTempTable("xxxxdata");
  schemaXXXXData.saveAsParquetFile(output);

We took the results of each folder and renamed the part file to match the 
original filename plus .parquet and dropped them all into one directory.

We created a java class that we then invoke using a 
spark-1.2.1/bin/spark-submit command...

      SparkConf sparkConf = new SparkConf().setAppName("XXXXX");
      JavaSparkContext ctx = new JavaSparkContext(sparkConf);
      JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
       
      final String dataFilePath = "/tmp/xxxxprocessor/xxxxsamplefiles_parquet";
      //final String dataFilePath = inputPath;

      // Create a JavaSchemaRDD from the file(s) pointed to by path
      JavaSchemaRDD xxxxData = sqlCtx.parquetFile(dataFilePath);

GOOD: when we run our spark app locally (specifying dataFilePath as a full 
filename of ONE specific parquet on local filesystem), all is well... the 
'sqlCtx.parquetFile(dataFilePath);' command finds the file and proceeds.

GOOD: when we run our spark app locally (specifying dataFilePath as a the 
directory that contains all the parquet files), all is well... the 
'sqlCtx.parquetFile(dataFilePath);' command rips thru each file in the 
dataFilePath directory and proceeds.

GOOD: if we do the same thing by uploading ONE of the parquet files to s3, and 
change our app to use the s3 path (giving it the full filename to ONE parquet 
file), all is good - code finds the file and proceeds...

BAD: if we then upload all the parquet files to s3 and specify the s3 directory 
where all the parquet files are, we get an NPE:

 Exception in thread "main" java.lang.NullPointerException
    at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
    at java.io.BufferedInputStream.close(BufferedInputStream.java:472)
    at java.io.FilterInputStream.close(FilterInputStream.java:181)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:428)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
    at scala.Option.map(Option.scala:145)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
    at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
    at 
org.apache.spark.sql.parquet.ParquetRelation.<init>(ParquetRelation.scala:65)
    at 
org.apache.spark.sql.api.java.JavaSQLContext.parquetFile(JavaSQLContext.scala:141)
    at 
com.aol.ido.spark.sql.XXXXFileIndexParquet.doWork(XXXFileIndexParquet.java:101)

Wondering why specifying a 'dir' works locally but not in S3...

BTW, we have done above steps using json formatted files and all four scenarios 
work well.

      // Create a JavaSchemaRDD from the file(s) pointed to by path
      JavaSchemaRDD xxxxData = sqlCtx.jsonFile(dataFilePath);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-6161) sqlCtx.parquetFile(dataFilePath) throws NPE when using s3, but OK when using local filesystem

Reply via email to