Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi,

I'm trying to run Spark without Hadoop where the data would be read and
written to local disk.

For this I have few Questions -

1. Which download I need to use ? In the download option I don't see any
binary download which does not need Hadoop. Is the only way to do this to
download the source code version and compile the same ?

2. Which installation/quick start guideline I should use for the same. So
far I didn't see any documentation which specifically addresses the Spark
without Hadoop installation/setup unless I'm missing out one.

Regards,
Sourav


Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Ted Yu
Sourav:
Please see https://spark.apache.org/docs/latest/spark-standalone.html

Cheers

On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 You really donot need hadoop installation. You can dowsload a pre-built
 version with any hadoop and unzip it and you are good to go. Yes it may
 complain while launching master and workers, safely ignore them. The only
 problem is while writing to a directory. Of course you will not be able to
 use any hadoop inputformat etc. out of the box.

 ** I am assuming its a learning question :) For production, I would
 suggest build it from source.

 If you are using python and need some help, please drop me a note off line.

 Best
 Ayan

 On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi,

 I'm trying to run Spark without Hadoop where the data would be read and
 written to local disk.

 For this I have few Questions -

 1. Which download I need to use ? In the download option I don't see any
 binary download which does not need Hadoop. Is the only way to do this to
 download the source code version and compile the same ?

 2. Which installation/quick start guideline I should use for the same. So
 far I didn't see any documentation which specifically addresses the Spark
 without Hadoop installation/setup unless I'm missing out one.

 Regards,
 Sourav




 --
 Best Regards,
 Ayan Guha



Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread ayan guha
Hi

You really donot need hadoop installation. You can dowsload a pre-built
version with any hadoop and unzip it and you are good to go. Yes it may
complain while launching master and workers, safely ignore them. The only
problem is while writing to a directory. Of course you will not be able to
use any hadoop inputformat etc. out of the box.

** I am assuming its a learning question :) For production, I would suggest
build it from source.

If you are using python and need some help, please drop me a note off line.

Best
Ayan

On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder 
sourav.mazumde...@gmail.com wrote:

 Hi,

 I'm trying to run Spark without Hadoop where the data would be read and
 written to local disk.

 For this I have few Questions -

 1. Which download I need to use ? In the download option I don't see any
 binary download which does not need Hadoop. Is the only way to do this to
 download the source code version and compile the same ?

 2. Which installation/quick start guideline I should use for the same. So
 far I didn't see any documentation which specifically addresses the Spark
 without Hadoop installation/setup unless I'm missing out one.

 Regards,
 Sourav




-- 
Best Regards,
Ayan Guha


Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Actually, Hadoop InputFormats can still be used to read and write from
file://, s3n://, and similar schemes. You just won't be able to
read/write to HDFS without installing Hadoop and setting up an HDFS cluster.

To summarize: Sourav, you can use any of the prebuilt packages (i.e.
anything other than source code).

Hope that helps,
-Jey

On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 You really donot need hadoop installation. You can dowsload a pre-built
 version with any hadoop and unzip it and you are good to go. Yes it may
 complain while launching master and workers, safely ignore them. The only
 problem is while writing to a directory. Of course you will not be able to
 use any hadoop inputformat etc. out of the box.

 ** I am assuming its a learning question :) For production, I would
 suggest build it from source.

 If you are using python and need some help, please drop me a note off line.

 Best
 Ayan

 On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi,

 I'm trying to run Spark without Hadoop where the data would be read and
 written to local disk.

 For this I have few Questions -

 1. Which download I need to use ? In the download option I don't see any
 binary download which does not need Hadoop. Is the only way to do this to
 download the source code version and compile the same ?

 2. Which installation/quick start guideline I should use for the same. So
 far I didn't see any documentation which specifically addresses the Spark
 without Hadoop installation/setup unless I'm missing out one.

 Regards,
 Sourav




 --
 Best Regards,
 Ayan Guha



Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
HI Jey,

Not much of luck.

If I use the class com.databricks:spark-csv_2.
11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found
error. With com.databricks.spark.csv I don't get the class not found error
but I still get the previous error even after using file:/// in the URI.

Regards,
Sourav

On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 Hi Sourav,

 The error seems to be caused by the fact that your URL starts with
 file:// instead of file:///.

 Also, I believe the current version of the package for Spark 1.4 with
 Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0.

 -Jey

 On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi Jey,

 Thanks for your inputs.

 Probably I'm getting error as I'm trying to read a csv file from local
 file using com.databricks.spark.csv package. Probably this package has hard
 coded dependency on Hadoop as it is trying to read input format from
 HadoopRDD.

 Can you please confirm ?

 Here is what I did -

 Ran the spark-shell as

 bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.

 Then in the shell I ran :
 val df = 
 sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv)



 Regards,
 Sourav

 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
 textFile at CsvRelation.scala:114
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
 at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
 at
 com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114)
 at
 com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112)
 at
 com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95)
 at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC$$iwC.init(console:32)
 at $iwC$$iwC.init(console:34)
 at $iwC.init(console:36)
 at init(console:38)
 at .init(console:42)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at .init(console:7)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Hi Sourav,

The error seems to be caused by the fact that your URL starts with
file:// instead of file:///.

Also, I believe the current version of the package for Spark 1.4 with Scala
2.11 should be com.databricks:spark-csv_2.11:1.1.0.

-Jey

On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder 
sourav.mazumde...@gmail.com wrote:

 Hi Jey,

 Thanks for your inputs.

 Probably I'm getting error as I'm trying to read a csv file from local
 file using com.databricks.spark.csv package. Probably this package has hard
 coded dependency on Hadoop as it is trying to read input format from
 HadoopRDD.

 Can you please confirm ?

 Here is what I did -

 Ran the spark-shell as

 bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.

 Then in the shell I ran :
 val df = 
 sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv)



 Regards,
 Sourav

 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
 textFile at CsvRelation.scala:114
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
 at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
 at
 com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114)
 at
 com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112)
 at
 com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95)
 at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC$$iwC.init(console:32)
 at $iwC$$iwC.init(console:34)
 at $iwC.init(console:36)
 at init(console:38)
 at .init(console:42)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at .init(console:7)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 at 

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi Jey,

This solves the class not found problem. Thanks.

But still the inputs format is not yet resolved. Looks like it is still
trying to create a HadoopRDD I don't know why. The error message goes like -

java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
at
com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:129)
at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127)
at
com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109)
at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:62)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:115)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
at $iwC$$iwC$$iwC$$iwC.init(console:30)
at $iwC$$iwC$$iwC.init(console:32)
at $iwC$$iwC.init(console:34)
at $iwC.init(console:36)
at init(console:38)
at .init(console:42)
at .clinit(console)
at java.lang.J9VMInternals.initializeImpl(Native Method)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
at .init(console:7)
at .clinit(console)
at java.lang.J9VMInternals.initializeImpl(Native Method)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
The format is still com.databricks.spark.csv, but the parameter passed to
spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0.

On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder 
sourav.mazumde...@gmail.com wrote:

 HI Jey,

 Not much of luck.

 If I use the class com.databricks:spark-csv_2.
 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found
 error. With com.databricks.spark.csv I don't get the class not found error
 but I still get the previous error even after using file:/// in the URI.

 Regards,
 Sourav

 On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 Hi Sourav,

 The error seems to be caused by the fact that your URL starts with
 file:// instead of file:///.

 Also, I believe the current version of the package for Spark 1.4 with
 Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0.

 -Jey

 On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi Jey,

 Thanks for your inputs.

 Probably I'm getting error as I'm trying to read a csv file from local
 file using com.databricks.spark.csv package. Probably this package has hard
 coded dependency on Hadoop as it is trying to read input format from
 HadoopRDD.

 Can you please confirm ?

 Here is what I did -

 Ran the spark-shell as

 bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.

 Then in the shell I ran :
 val df = 
 sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv)



 Regards,
 Sourav

 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
 textFile at CsvRelation.scala:114
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
 at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
 at
 com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114)
 at
 com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112)
 at
 com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95)
 at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC$$iwC.init(console:32)
 at $iwC$$iwC.init(console:34)
 at $iwC.init(console:36)
 at init(console:38)
 at .init(console:42)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at .init(console:7)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at 

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
All InputFormats will use HadoopRDD or NewHadoopRDD. Do you use file:///
instead of file://?

On Mon, Jun 29, 2015 at 8:40 PM, Sourav Mazumder 
sourav.mazumde...@gmail.com wrote:

 Hi Jey,

 This solves the class not found problem. Thanks.

 But still the inputs format is not yet resolved. Looks like it is still
 trying to create a HadoopRDD I don't know why. The error message goes like -


 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
 at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
 at
 com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:129)
 at
 com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127)
 at
 com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109)
 at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:62)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:115)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)

 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC$$iwC.init(console:32)
 at $iwC$$iwC.init(console:34)
 at $iwC.init(console:36)
 at init(console:38)
 at .init(console:42)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at .init(console:7)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
 at
 

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi Jey,

Thanks for your inputs.

Probably I'm getting error as I'm trying to read a csv file from local file
using com.databricks.spark.csv package. Probably this package has hard
coded dependency on Hadoop as it is trying to read input format from
HadoopRDD.

Can you please confirm ?

Here is what I did -

Ran the spark-shell as

bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.

Then in the shell I ran :
val df = 
sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv)



Regards,
Sourav

15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
textFile at CsvRelation.scala:114
java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
at
com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114)
at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112)
at
com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95)
at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
at $iwC$$iwC$$iwC$$iwC.init(console:30)
at $iwC$$iwC$$iwC.init(console:32)
at $iwC$$iwC.init(console:34)
at $iwC.init(console:36)
at init(console:38)
at .init(console:42)
at .clinit(console)
at java.lang.J9VMInternals.initializeImpl(Native Method)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
at .init(console:7)
at .clinit(console)
at java.lang.J9VMInternals.initializeImpl(Native Method)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at