Re: Loading lots of parquet files into dataframe from s3

2015-06-18 Thread lovelylavs
You can do something like this:

 ObjectListing objectListing; 


do { 
objectListing = s3Client.listObjects(listObjectsRequest); 
for (S3ObjectSummary objectSummary : 
objectListing.getObjectSummaries()) { 

if ((objectSummary.getLastModified().compareTo(dayBefore) >
0)  && (objectSummary.getLastModified().compareTo(dayAfter) <1) &&
objectSummary.getKey().contains(".log")) 
FileNames.add(objectSummary.getKey()); 
} 
listObjectsRequest.setMarker(objectListing.getNextMarker()); 
} while (objectListing.isTruncated()); 


String concatName= "";
for(String fName : FileNames) {
   if(FileNames.indexOf(fName) == (FileNames.size() -1)) {
  concatName+= "s3n://" + s3_bucket + "/" + fName;
   } else {
  concatName+= "s3n://" + s3_bucket + "/" + fName + ",";
   }
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-into-dataframe-from-s3-tp23127p23394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading lots of parquet files into dataframe from s3

2015-06-17 Thread arnonrgo
What happens is that Spark opens the files so in order to merge the schema.
Unfortunately spark has an assumption that the files are local so that
access would be fast which makes this step in s3 extremely slow.

If you know all the files use the same schema (e.g. it is a result of a
previous job) you can tell Spark to skip this check by specifying the option
"mergeSchema" "false" as in
read.format("parquet").option("mergeSchema","false").load("path")





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-lots-of-parquet-files-into-dataframe-from-s3-tp23127p23366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org