Running Spark 1.4.1 without Hadoop
Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the source code version and compile the same ? 2. Which installation/quick start guideline I should use for the same. So far I didn't see any documentation which specifically addresses the Spark without Hadoop installation/setup unless I'm missing out one. Regards, Sourav
Re: Running Spark 1.4.1 without Hadoop
Sourav: Please see https://spark.apache.org/docs/latest/spark-standalone.html Cheers On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote: Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to use any hadoop inputformat etc. out of the box. ** I am assuming its a learning question :) For production, I would suggest build it from source. If you are using python and need some help, please drop me a note off line. Best Ayan On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the source code version and compile the same ? 2. Which installation/quick start guideline I should use for the same. So far I didn't see any documentation which specifically addresses the Spark without Hadoop installation/setup unless I'm missing out one. Regards, Sourav -- Best Regards, Ayan Guha
Re: Running Spark 1.4.1 without Hadoop
Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to use any hadoop inputformat etc. out of the box. ** I am assuming its a learning question :) For production, I would suggest build it from source. If you are using python and need some help, please drop me a note off line. Best Ayan On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the source code version and compile the same ? 2. Which installation/quick start guideline I should use for the same. So far I didn't see any documentation which specifically addresses the Spark without Hadoop installation/setup unless I'm missing out one. Regards, Sourav -- Best Regards, Ayan Guha
Re: Running Spark 1.4.1 without Hadoop
Actually, Hadoop InputFormats can still be used to read and write from file://, s3n://, and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop and setting up an HDFS cluster. To summarize: Sourav, you can use any of the prebuilt packages (i.e. anything other than source code). Hope that helps, -Jey On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote: Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to use any hadoop inputformat etc. out of the box. ** I am assuming its a learning question :) For production, I would suggest build it from source. If you are using python and need some help, please drop me a note off line. Best Ayan On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the source code version and compile the same ? 2. Which installation/quick start guideline I should use for the same. So far I didn't see any documentation which specifically addresses the Spark without Hadoop installation/setup unless I'm missing out one. Regards, Sourav -- Best Regards, Ayan Guha
Re: Running Spark 1.4.1 without Hadoop
HI Jey, Not much of luck. If I use the class com.databricks:spark-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI. Regards, Sourav On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote: Hi Sourav, The error seems to be caused by the fact that your URL starts with file:// instead of file:///. Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0. -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is what I did - Ran the spark-shell as bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3. Then in the shell I ran : val df = sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv) Regards, Sourav 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from textFile at CsvRelation.scala:114 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at
Re: Running Spark 1.4.1 without Hadoop
Hi Sourav, The error seems to be caused by the fact that your URL starts with file:// instead of file:///. Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0. -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is what I did - Ran the spark-shell as bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3. Then in the shell I ran : val df = sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv) Regards, Sourav 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from textFile at CsvRelation.scala:114 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at
Re: Running Spark 1.4.1 without Hadoop
Hi Jey, This solves the class not found problem. Thanks. But still the inputs format is not yet resolved. Looks like it is still trying to create a HadoopRDD I don't know why. The error message goes like - java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:129) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:62) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:115) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at
Re: Running Spark 1.4.1 without Hadoop
The format is still com.databricks.spark.csv, but the parameter passed to spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0. On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: HI Jey, Not much of luck. If I use the class com.databricks:spark-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI. Regards, Sourav On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote: Hi Sourav, The error seems to be caused by the fact that your URL starts with file:// instead of file:///. Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0. -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is what I did - Ran the spark-shell as bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3. Then in the shell I ran : val df = sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv) Regards, Sourav 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from textFile at CsvRelation.scala:114 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at
Re: Running Spark 1.4.1 without Hadoop
All InputFormats will use HadoopRDD or NewHadoopRDD. Do you use file:/// instead of file://? On Mon, Jun 29, 2015 at 8:40 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, This solves the class not found problem. Thanks. But still the inputs format is not yet resolved. Looks like it is still trying to create a HadoopRDD I don't know why. The error message goes like - java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:129) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:62) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:115) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at
Re: Running Spark 1.4.1 without Hadoop
Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is what I did - Ran the spark-shell as bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3. Then in the shell I ran : val df = sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv) Regards, Sourav 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from textFile at CsvRelation.scala:114 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at