Re: A basic question

2019-06-17 Thread Shyam P
Thank you so much Deepak.
Let me implement and update you. Hope it works.

Any short-comings I need to consider or take care of ?

Regards,
Shyam

On Mon, Jun 17, 2019 at 12:39 PM Deepak Sharma 
wrote:

> You can follow this example:
>
> https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html
>
>
> On Mon, Jun 17, 2019 at 12:27 PM Shyam P  wrote:
>
>> I am developing a spark job using java1.8v.
>>
>> Is it possible to write a spark app using spring-boot technology?
>> Did anyone tried it ? if so how it should be done?
>>
>>
>> Regards,
>> Shyam
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: A basic question

2019-06-17 Thread Deepak Sharma
You can follow this example:
https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html


On Mon, Jun 17, 2019 at 12:27 PM Shyam P  wrote:

> I am developing a spark job using java1.8v.
>
> Is it possible to write a spark app using spring-boot technology?
> Did anyone tried it ? if so how it should be done?
>
>
> Regards,
> Shyam
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Marco Mistroni
HI Ashok
  this is not really a spark-related question so i would not use this
mailing list.
Anyway, my 2 cents here
as outlined by earlier replies, if the class you are referencing is in a
different jar, at compile time you will need to add that dependency to your
build.sbt,
I'd personally leave alone $CLASSPATH...

AT RUN TIME, you have two options:
1 -  as suggested by Ted, when yo u launch your app via spark-submit you
can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass the jar.
2 - Use sbt assembly plugin to package your classes and jars into a 'fat
jar', and then at runtime all you  need to do is to do

 spark-submit   --class   

I'd personally go for 1   as it is the easiest option. (FYI for 2  you
might encounter situations where you have dependencies referring to same
classes, adn that will require you to define an assemblyMergeStrategy)

hth




On Mon, Jun 6, 2016 at 8:52 AM, Ashok Kumar 
wrote:

> Anyone can help me with this please
>
>
> On Sunday, 5 June 2016, 11:06, Ashok Kumar  wrote:
>
>
> Hi all,
>
> Appreciate any advice on this. It is about scala
>
> I have created a very basic Utilities.scala that contains a test class and
> method. I intend to add my own classes and methods as I expand and make
> references to these classes and methods in my other apps
>
> class getCheckpointDirectory {
>   def CheckpointDirectory (ProgramName: String) : String  = {
>  var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName
>  return hdfsDir
>   }
> }
> I have used sbt to create a jar file for it. It is created as a jar file
>
> utilities-assembly-0.1-SNAPSHOT.jar
>
> Now I want to make a call to that method CheckpointDirectory in my app
> code myapp.dcala to return the value for hdfsDir
>
>val ProgramName = this.getClass.getSimpleName.trim
>val getCheckpointDirectory =  new getCheckpointDirectory
>val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName)
>
> However, I am getting a compilation error as expected
>
> not found: type getCheckpointDirectory
> [error] val getCheckpointDirectory =  new getCheckpointDirectory
> [error]   ^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
>
> So a basic question, in order for compilation to work do I need to create
> a package for my jar file or add dependency like the following I do in sbt
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>
>
> Or add the jar file to $CLASSPATH?
>
> Any advise will be appreciated.
>
> Thanks
>
>
>
>
>
>
>


RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
I guess I should also point out that I do an

export CLASSPATH

in my  .bash_profile file, so the CLASSPATH info should be usable by the 
PySpark shell that I invoke.

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Taylor, Ronald C
Sent: Monday, February 29, 2016 2:57 PM
To: 'Yin Yang'; user@spark.apache.org
Cc: Jules Damji; ronald.taylo...@gmail.com; Taylor, Ronald C
Subject: RE: a basic question on first use of PySpark shell and example, which 
is failing

HI Yin,

My Classpath is set to:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:.

And there is indeed a spark-core.jar in the ../jars subdirectory, though it is 
not named precisely “spark-core.jar”. It has a version number in its name, as 
you can see:

[rtaylor@bigdatann ~]$ find 
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar"

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar

I extracted the class names into a text file:

[rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > 
/people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt

And then searched for RDDOperationScope. I found these classes:

[rtaylor@bigdatann SparkWork]$ grep RDDOperationScope 
jar_file_listing_of_spark-core_jar.txt

org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class
org/apache/spark/rdd/RDDOperationScope$.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class
org/apache/spark/rdd/RDDOperationScope.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class
[rtaylor@bigdatann SparkWork]$


It looks like the RDDOperationScope class is present. Shouldn’t that work?

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: 
ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Yin Yang [mailto:yy201...@gmail.com]
Sent: Monday, February 29, 2016 2:27 PM
To: Taylor, Ronald C
Cc: Jules Damji; user@spark.apache.org<mailto:user@spark.apache.org>; 
ronald.taylo...@gmail.com<mailto:ronald.taylo...@gmail.com>
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing

RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C 
<ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>> wrote:
Hi Jules, folks,

I have tried “hdfs://” as well as “file://”.  And several variants. 
Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get 
such a msg, if the problem is simply that Spark cannot find the text file? 
Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a 
Python programmer.  But doesn’t it look like a call to a Java class –something 
associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568<tel:%28509%29%20372-6568>,  email: 
ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_

RE: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Taylor, Ronald C
HI Yin,

My Classpath is set to:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:.

And there is indeed a spark-core.jar in the ../jars subdirectory, though it is 
not named precisely “spark-core.jar”. It has a version number in its name, as 
you can see:

[rtaylor@bigdatann ~]$ find 
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar"

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar

I extracted the class names into a text file:

[rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > 
/people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt

And then searched for RDDOperationScope. I found these classes:

[rtaylor@bigdatann SparkWork]$ grep RDDOperationScope 
jar_file_listing_of_spark-core_jar.txt

org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class
org/apache/spark/rdd/RDDOperationScope$.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class
org/apache/spark/rdd/RDDOperationScope.class
org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class
[rtaylor@bigdatann SparkWork]$


It looks like the RDDOperationScope class is present. Shouldn’t that work?

Ron

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Yin Yang [mailto:yy201...@gmail.com]
Sent: Monday, February 29, 2016 2:27 PM
To: Taylor, Ronald C
Cc: Jules Damji; user@spark.apache.org; ronald.taylo...@gmail.com
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing

RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C 
<ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>> wrote:
Hi Jules, folks,

I have tried “hdfs://” as well as “file://”.  And several variants. 
Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get 
such a msg, if the problem is simply that Spark cannot find the text file? 
Doesn’t the error msg indicate some other source of the problem?

I may be missing something in the error report; I am a Java person, not a 
Python programmer.  But doesn’t it look like a call to a Java class –something 
associated with “o9.textFile” -  is failing?  If so, how do I fix this?

  Ron


"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
 line 451, in textFile
return RDD(self._jsc.textFile(name, minPartitions), self,
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$

Ronald C. Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568<tel:%28509%29%20372-6568>,  email: 
ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>
web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048

From: Jules Damji [mailto:dmat...@comcast.net<mailto:dmat...@comcast.net>]
Sent: Sunday, February 28, 2016 10:07 PM
To: Taylor, Ronald C
Cc: user@spark.apache.org<mailto:user@spark.apache.org>; 
ronald.taylo...@gmail.com<mailto:ronald.taylo...@gmail.com>
Subject: Re: a basic question on first use of PySpark shell and example, which 
is failing


Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name 
to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C 
<ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>> wrote:

Hello folks,

I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at 
our lab. I 

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Yin Yang
RDDOperationScope is in spark-core_2.1x jar file.

  7148 Mon Feb 29 09:21:32 PST 2016
org/apache/spark/rdd/RDDOperationScope.class

Can you check whether the spark-core jar is in classpath ?

FYI

On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov>
wrote:

> Hi Jules, folks,
>
>
>
> I have tried “hdfs://” as well as “file:// filepath>”.  And several variants. Every time, I get the same msg –
> NoClassDefFoundError. See below. Why do I get such a msg, if the problem is
> simply that Spark cannot find the text file? Doesn’t the error msg indicate
> some other source of the problem?
>
>
>
> I may be missing something in the error report; I am a Java person, not a
> Python programmer.  But doesn’t it look like a call to a Java class
> –something associated with “o9.textFile” -  is failing?  If so, how do I
> fix this?
>
>
>
>   Ron
>
>
>
>
>
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
> line 451, in textFile
>
> return RDD(self._jsc.textFile(name, minPartitions), self,
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
> line 36, in deco
>
> return f(*a, **kw)
>
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
>
> : java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.rdd.RDDOperationScope$
>
>
>
> Ronald C. Taylor, Ph.D.
>
> Computational Biology & Bioinformatics Group
>
> Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
>
> Richland, WA 99352
>
> phone: (509) 372-6568,  email: ronald.tay...@pnnl.gov
>
> web page:  http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048
>
>
>
> *From:* Jules Damji [mailto:dmat...@comcast.net]
> *Sent:* Sunday, February 28, 2016 10:07 PM
> *To:* Taylor, Ronald C
> *Cc:* user@spark.apache.org; ronald.taylo...@gmail.com
> *Subject:* Re: a basic question on first use of PySpark shell and
> example, which is failing
>
>
>
>
>
> Hello Ronald,
>
>
>
> Since you have placed the file under HDFS, you might same change the path
> name to:
>
>
>
> val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")
>
>
> Sent from my iPhone
>
> Pardon the dumb thumb typos :)
>
>
> On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov>
> wrote:
>
>
>
> Hello folks,
>
>
>
> I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster
> at our lab. I am trying to use the PySpark shell for the first time. and am
> attempting to  duplicate the documentation example of creating an RDD
> which I called "lines" using a text file.
>
> I placed a a text file called Warehouse.java in this HDFS location:
>
>
> [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
> -rw-r--r--   3 rtaylor supergroup1155355 2016-02-28 18:09
> /user/rtaylor/Spark/Warehouse.java
> [rtaylor@bigdatann ~]$
>
> I then invoked sc.textFile()in the PySpark shell.That did not work. See
> below. Apparently a class is not found? Don't know why that would be the
> case. Any guidance would be very much appreciated.
>
> The Cloudera Manager for the cluster says that Spark is operating  in the
> "green", for whatever that is worth.
>
>  - Ron Taylor
>
>
> >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
> line 451, in textFile
> return RDD(self._jsc.textFile(name, minPartitions), self,
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
> line 36, in deco
> return f(*a, **kw)
>   File
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
> : 

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-28 Thread Jules Damji

Hello Ronald,

Since you have placed the file under HDFS, you might same change the path name 
to:

val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java")

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C  wrote:
> 
> 
> Hello folks,
> 
> I  am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at 
> our lab. I am trying to use the PySpark shell for the first time. and am 
> attempting to  duplicate the documentation example of creating an RDD  which 
> I called "lines" using a text file.
> 
> I placed a a text file called Warehouse.java in this HDFS location:
> 
> [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark
> -rw-r--r--   3 rtaylor supergroup1155355 2016-02-28 18:09 
> /user/rtaylor/Spark/Warehouse.java
> [rtaylor@bigdatann ~]$ 
> 
> I then invoked sc.textFile()in the PySpark shell.That did not work. See 
> below. Apparently a class is not found? Don't know why that would be the 
> case. Any guidance would be very much appreciated.
> 
> The Cloudera Manager for the cluster says that Spark is operating  in the 
> "green", for whatever that is worth.
> 
>  - Ron Taylor
> 
> >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java")
> 
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py",
>  line 451, in textFile
> return RDD(self._jsc.textFile(name, minPartitions), self,
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile.
> : java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.spark.rdd.RDDOperationScope$
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
> at org.apache.spark.SparkContext.textFile(SparkContext.scala:825)
> at 
> org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> 
> >>> 


Re: initial basic question from new user

2014-06-12 Thread Gerard Maas
The goal of rdd.persist is to created a cached rdd that breaks the DAG
lineage. Therefore, computations *in the same job* that use that RDD can
re-use that intermediate result, but it's not meant to survive between job
runs.

for example:

val baseData = rawDataRdd.map(...).flatMap(...).reduceByKey(...).persist
val metric1 = baseData.flatMap(op1).reduceByKey.collect
val metric2 = baseData.flatMap(op2).reduceByKey.collect

Without persist, computing metric1 and metric2 would trigger the
computation starting from rawData. With persist, both metric1 and metric2
will start from the intermediate result (baseData)

If you need to ad-hoc persist to files, you can can save RDDs using
rdd.saveAsObjectFile(...) [1] and load them afterwards using
sparkContext.objectFile(...)
If you want to preserve the RDDs in memory between job runs, you should
look at the Spark-JobServer [3]

[1]
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

[2]
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.SparkContext

[3] https://github.com/ooyala/spark-jobserver



On Thu, Jun 12, 2014 at 11:24 AM, Toby Douglass t...@avocet.io wrote:

 Gents,

 I am investigating Spark with a view to perform reporting on a large data
 set, where the large data set receives additional data in the form of log
 files on an hourly basis.

 Where the data set is large there is a possibility we will create a range
 of aggregate tables, to reduce the volume of data which has to be processed.

 Having spent a little while reading up about Spark, my thought was that I
 could create an RDD which is an agg, persist this to disk, have reporting
 queries run against that RDD and when new data arrives, convert the new log
 file into an agg and add that to the agg RDD.

 However, I begin now to get the impression that RDDs cannot be persisted
 across jobs - I can generate an RDD, I can persist it, but I can see no way
 for a later job to load a persisted RDD (and I begin to think it will have
 been GCed anyway, at the end of the first job).  Is this correct?





Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
for your use case. As for Parquet support, that's newly arrived in Spark
1.0.0 together with SparkSQL so continue to watch this space.

Gerard's suggestion to look at JobServer, which you can generalize as
building a long-running application which allows multiple clients to
load/share/persist/save/collaborate-on RDDs satisfies a larger, more
complex use case. That is indeed the job of a higher-level application,
subject to a wide variety of higher-level design choices. A number of us
have successfully built Spark-based apps around that model.
--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass t...@avocet.io wrote:

 On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 The goal of rdd.persist is to created a cached rdd that breaks the DAG
 lineage. Therefore, computations *in the same job* that use that RDD can
 re-use that intermediate result, but it's not meant to survive between job
 runs.


 As I understand it, Spark is designed for interactive querying, in the
 sense that the caching of intermediate results eliminates the need to
 recompute those results.

 However, if intermediate results last only for the duration of a job (e.g.
 say a python script), how exactly is interactive querying actually
 performed?   a script is not an interactive medium.  Is the shell the only
 medium for interactive querying?

 Consider a common usage case : a web-site, which offers reporting upon a
 large data set.  Users issue arbitrary queries.  A few queries (just with
 different arguments) dominate the query load, so we thought to create
 intermediate RDDs to service those queries, so only those order of
 magnitude or smaller RDDs would need to be processed.  Where this is not
 possible, we can only use Spark for reporting by issuing each query over
 the whole data set - e.g. Spark is just like Impala is just like Presto is
 just like [nnn].  The enourmous benefit of RDDs - the entire point of Spark
 - so profoundly useful here - is not available.  What a huge and unexpected
 loss!  Spark seemingly renders itself ordinary.  It is for this reason I am
 surprised to find this functionality is not available.


 If you need to ad-hoc persist to files, you can can save RDDs using
 rdd.saveAsObjectFile(...) [1] and load them afterwards using
 sparkContext.objectFile(...)


 I've been using this site for docs;

 http://spark.apache.org

 Here we find through the top-of-the-page menus the link API Docs -
 Python API which brings us to;

 http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html

 Where this page does not show the function saveAsObjectFile().

 I find now from your link here;


 https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

 What appears to be a second and more complete set of the same
 documentation, using a different web-interface to boot.

 It appears at least that there are two sets of documentation for the same
 APIs, where one set is out of the date and the other not, and the out of
 date set is that which is linked to from the main site?

 Given that our agg sizes will exceed memory, we expect to cache them to
 disk, so save-as-object (assuming there are no out of the ordinary
 performance issues) may solve the problem, but I was hoping to store data
 is a column orientated format.  However I think this in general is not
 possible - Spark can *read* Parquet, but I think it cannot write Parquet as
 a disk-based RDD format.

 If you want to preserve the RDDs in memory between job runs, you should
 look at the Spark-JobServer [3]


 Thankyou.

 I view this with some trepidation.  It took two man-days to get Spark
 running (and I've spent another man day now trying to get a map/reduce to
 run; I'm getting there, but not there yet) - the bring-up/config experience
 for end-users is not tested or accurated documented (although to be clear,
 no better and no worse than is normal for open source; Spark is not
 exceptional).  Having to bring up another open source project is a
 significant barrier to entry; it's always such a headache.

 The save-to-disk function you mentioned earlier will allow intermediate
 RDDs to go to disk, but we do in fact have a use case where in-memory would
 be useful; it might allow us to ditch Cassandra, which would be wonderful,
 since it would reduce the system count by one.

 I have to say, having to install JobServer to achieve this one end seems
 an extraordinarily heavyweight solution - a whole new application, when all
 that is wished for is that Spark persists RDDs across jobs, where so small
 a feature seems to open the door to so much functionality.





Re: initial basic question from new user

2014-06-12 Thread FRANK AUSTIN NOTHAFT
RE:

 Given that our agg sizes will exceed memory, we expect to cache them to
disk, so save-as-object (assuming there are no out of the ordinary
performance issues) may solve the problem, but I was hoping to store data
is a column orientated format.  However I think this in general is not
possible - Spark can *read* Parquet, but I think it cannot write Parquet as
a disk-based RDD format.

Spark can write Parquet, via the ParquetOutputFormat which is distributed
from Parquet. If you'd like example code for writing out to Parquet, please
see the adamSave function in
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala,
starting at line 62. There is a bit of setup necessary for the Parquet
write codec, but otherwise it is fairly straightforward.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466


On Thu, Jun 12, 2014 at 7:03 AM, Christopher Nguyen c...@adatao.com wrote:

 Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
 for your use case. As for Parquet support, that's newly arrived in Spark
 1.0.0 together with SparkSQL so continue to watch this space.

 Gerard's suggestion to look at JobServer, which you can generalize as
 building a long-running application which allows multiple clients to
 load/share/persist/save/collaborate-on RDDs satisfies a larger, more
 complex use case. That is indeed the job of a higher-level application,
 subject to a wide variety of higher-level design choices. A number of us
 have successfully built Spark-based apps around that model.
 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen



 On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass t...@avocet.io wrote:

 On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 The goal of rdd.persist is to created a cached rdd that breaks the DAG
 lineage. Therefore, computations *in the same job* that use that RDD can
 re-use that intermediate result, but it's not meant to survive between job
 runs.


 As I understand it, Spark is designed for interactive querying, in the
 sense that the caching of intermediate results eliminates the need to
 recompute those results.

 However, if intermediate results last only for the duration of a job
 (e.g. say a python script), how exactly is interactive querying actually
 performed?   a script is not an interactive medium.  Is the shell the only
 medium for interactive querying?

 Consider a common usage case : a web-site, which offers reporting upon a
 large data set.  Users issue arbitrary queries.  A few queries (just with
 different arguments) dominate the query load, so we thought to create
 intermediate RDDs to service those queries, so only those order of
 magnitude or smaller RDDs would need to be processed.  Where this is not
 possible, we can only use Spark for reporting by issuing each query over
 the whole data set - e.g. Spark is just like Impala is just like Presto is
 just like [nnn].  The enourmous benefit of RDDs - the entire point of Spark
 - so profoundly useful here - is not available.  What a huge and unexpected
 loss!  Spark seemingly renders itself ordinary.  It is for this reason I am
 surprised to find this functionality is not available.


 If you need to ad-hoc persist to files, you can can save RDDs using
 rdd.saveAsObjectFile(...) [1] and load them afterwards using
 sparkContext.objectFile(...)


 I've been using this site for docs;

 http://spark.apache.org

 Here we find through the top-of-the-page menus the link API Docs -
 Python API which brings us to;

 http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html

 Where this page does not show the function saveAsObjectFile().

 I find now from your link here;


 https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

 What appears to be a second and more complete set of the same
 documentation, using a different web-interface to boot.

 It appears at least that there are two sets of documentation for the same
 APIs, where one set is out of the date and the other not, and the out of
 date set is that which is linked to from the main site?

 Given that our agg sizes will exceed memory, we expect to cache them to
 disk, so save-as-object (assuming there are no out of the ordinary
 performance issues) may solve the problem, but I was hoping to store data
 is a column orientated format.  However I think this in general is not
 possible - Spark can *read* Parquet, but I think it cannot write Parquet as
 a disk-based RDD format.

 If you want to preserve the RDDs in memory between job runs, you should
 look at the Spark-JobServer [3]


 Thankyou.

 I view this with some trepidation.  It took two man-days to get Spark
 running (and I've spent another man day now trying to get a map/reduce to
 run; I'm getting there, but not there yet) - the bring-up/config experience
 for end-users is not 

Re: initial basic question from new user

2014-06-12 Thread Andre Schumacher

Hi,

On 06/12/2014 05:47 PM, Toby Douglass wrote:

 In these future jobs, when I come to load the aggregted RDD, will Spark
 load and only load the columns being accessed by the query?  or will Spark
 load everything, to convert it into an internal representation, and then
 execute the query?

The aforementioned native Parquet support in Spark 1.0 supports column
projections which means only the columns that appear in the query will
be loaded. The next release will also support record filters for simple
pruning predicates (int-column smaller value and such). This is
different from using a Hadoop Input/Output format and requires no
additional setup (jars in classpath and such).

For more details see:

http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet

Andre


Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher 
schum...@icsi.berkeley.edu wrote:

 On 06/12/2014 05:47 PM, Toby Douglass wrote:

  In these future jobs, when I come to load the aggregted RDD, will Spark
  load and only load the columns being accessed by the query?  or will
 Spark
  load everything, to convert it into an internal representation, and then
  execute the query?

 The aforementioned native Parquet support in Spark 1.0 supports column
 projections which means only the columns that appear in the query will
 be loaded.


[snip]

Thankyou!