Hi, I was exploring the possibility of CTAS with spark-sql (SPARK-1.3.1) for saving the big results into CSV formatted files for offline viewing. These are the two things that I did
1. CREATE TABLE IF NOT EXISTS csv_dump27 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/data/offline/' as select X, Y from tableName where timestamp=1427094000; 2. CREATE TABLE IF NOT EXISTS csv_dump44 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS PARQUET LOCATION '/data/offline/' as select X, Y from tableName where timestamp=1427094000; Here are my observations Case1 – I am able to create the table in hive and can see the CSV data at the provided path. But when I try to do some queries over this table through spark-sql I get exceptions. (StackTrace below). Case2 – I am able to create the table in hive, and can see the parquet files. But strangely this time I am able to query this table through spark-sql without any exception. I can’t use this though since saving data as parquet does not serve my purpose of offline viewing in CSV. So my question is “Is CTAS supported in spark-sql with storage as TEXT?” On some debugging, I saw that the "dir” (directory) variable passed to the “TextInputFormat” contains the appended paths of each part file under that directory. StrilUtils.split is not able to split the paths in to individual paths. hdfs://NN-199:9000/data/offline/part-00000\,hdfs:/NN-199:9000/data/offlline/part-00001\,hdfs:/NN-199:9000/data/offline/part-00002\………….. . . . . .. However in the case of Parquet, the “dir" variable is correctly passed as only the top level directory hdfs://NN-199:9000/data/offline/csv_dump44 P.S.:- StackTrace java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 10: part-00000,hdfs: at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:172) at org.apache.hadoop.fs.Path.<init>(Path.java:94) at org.apache.hadoop.fs.Globber.glob(Globber.java:211) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1556) at org.apache.spark.rdd.RDD.collect(RDD.scala:825) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:88) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:423) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 10: part-00000,hdfs: at java.net.URI$Parser.fail(Unknown Source) at java.net.URI$Parser.checkChars(Unknown Source) at java.net.URI$Parser.parse(Unknown Source) at java.net.URI.<init>(Unknown Source) at org.apache.hadoop.fs.Path.initialize(Path.java:203) Thanks Kashish Jain