Re: A basic question
Thank you so much Deepak. Let me implement and update you. Hope it works. Any short-comings I need to consider or take care of ? Regards, Shyam On Mon, Jun 17, 2019 at 12:39 PM Deepak Sharma wrote: > You can follow this example: > > https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html > > > On Mon, Jun 17, 2019 at 12:27 PM Shyam P wrote: > >> I am developing a spark job using java1.8v. >> >> Is it possible to write a spark app using spring-boot technology? >> Did anyone tried it ? if so how it should be done? >> >> >> Regards, >> Shyam >> > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net >
Re: A basic question
You can follow this example: https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html On Mon, Jun 17, 2019 at 12:27 PM Shyam P wrote: > I am developing a spark job using java1.8v. > > Is it possible to write a spark app using spring-boot technology? > Did anyone tried it ? if so how it should be done? > > > Regards, > Shyam > -- Thanks Deepak www.bigdatabig.com www.keosha.net
Re: Fw: Basic question on using one's own classes in the Scala app
HI Ashok this is not really a spark-related question so i would not use this mailing list. Anyway, my 2 cents here as outlined by earlier replies, if the class you are referencing is in a different jar, at compile time you will need to add that dependency to your build.sbt, I'd personally leave alone $CLASSPATH... AT RUN TIME, you have two options: 1 - as suggested by Ted, when yo u launch your app via spark-submit you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass the jar. 2 - Use sbt assembly plugin to package your classes and jars into a 'fat jar', and then at runtime all you need to do is to do spark-submit --class I'd personally go for 1 as it is the easiest option. (FYI for 2 you might encounter situations where you have dependencies referring to same classes, adn that will require you to define an assemblyMergeStrategy) hth On Mon, Jun 6, 2016 at 8:52 AM, Ashok Kumarwrote: > Anyone can help me with this please > > > On Sunday, 5 June 2016, 11:06, Ashok Kumar wrote: > > > Hi all, > > Appreciate any advice on this. It is about scala > > I have created a very basic Utilities.scala that contains a test class and > method. I intend to add my own classes and methods as I expand and make > references to these classes and methods in my other apps > > class getCheckpointDirectory { > def CheckpointDirectory (ProgramName: String) : String = { > var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName > return hdfsDir > } > } > I have used sbt to create a jar file for it. It is created as a jar file > > utilities-assembly-0.1-SNAPSHOT.jar > > Now I want to make a call to that method CheckpointDirectory in my app > code myapp.dcala to return the value for hdfsDir > >val ProgramName = this.getClass.getSimpleName.trim >val getCheckpointDirectory = new getCheckpointDirectory >val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName) > > However, I am getting a compilation error as expected > > not found: type getCheckpointDirectory > [error] val getCheckpointDirectory = new getCheckpointDirectory > [error] ^ > [error] one error found > [error] (compile:compileIncremental) Compilation failed > > So a basic question, in order for compilation to work do I need to create > a package for my jar file or add dependency like the following I do in sbt > > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" > libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1" > libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" > > > Or add the jar file to $CLASSPATH? > > Any advise will be appreciated. > > Thanks > > > > > > >
RE: a basic question on first use of PySpark shell and example, which is failing
I guess I should also point out that I do an export CLASSPATH in my .bash_profile file, so the CLASSPATH info should be usable by the PySpark shell that I invoke. Ron Ronald C. Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568, email: ronald.tay...@pnnl.gov web page: http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048 From: Taylor, Ronald C Sent: Monday, February 29, 2016 2:57 PM To: 'Yin Yang'; user@spark.apache.org Cc: Jules Damji; ronald.taylo...@gmail.com; Taylor, Ronald C Subject: RE: a basic question on first use of PySpark shell and example, which is failing HI Yin, My Classpath is set to: CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:. And there is indeed a spark-core.jar in the ../jars subdirectory, though it is not named precisely “spark-core.jar”. It has a version number in its name, as you can see: [rtaylor@bigdatann ~]$ find /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar" /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar I extracted the class names into a text file: [rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > /people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt And then searched for RDDOperationScope. I found these classes: [rtaylor@bigdatann SparkWork]$ grep RDDOperationScope jar_file_listing_of_spark-core_jar.txt org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class org/apache/spark/rdd/RDDOperationScope$.class org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class org/apache/spark/rdd/RDDOperationScope.class org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class [rtaylor@bigdatann SparkWork]$ It looks like the RDDOperationScope class is present. Shouldn’t that work? Ron Ronald C. Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568, email: ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov> web page: http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048 From: Yin Yang [mailto:yy201...@gmail.com] Sent: Monday, February 29, 2016 2:27 PM To: Taylor, Ronald C Cc: Jules Damji; user@spark.apache.org<mailto:user@spark.apache.org>; ronald.taylo...@gmail.com<mailto:ronald.taylo...@gmail.com> Subject: Re: a basic question on first use of PySpark shell and example, which is failing RDDOperationScope is in spark-core_2.1x jar file. 7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class Can you check whether the spark-core jar is in classpath ? FYI On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>> wrote: Hi Jules, folks, I have tried “hdfs://” as well as “file://”. And several variants. Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get such a msg, if the problem is simply that Spark cannot find the text file? Doesn’t the error msg indicate some other source of the problem? I may be missing something in the error report; I am a Java person, not a Python programmer. But doesn’t it look like a call to a Java class –something associated with “o9.textFile” - is failing? If so, how do I fix this? Ron "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile return RDD(self._jsc.textFile(name, minPartitions), self, File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco return f(*a, **kw) File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile. : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ Ronald C. Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568<tel:%28509%29%20372-6568>, email: ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov> web page: http://www.pnnl.gov/science/staff/staff_info.asp?staff_
RE: a basic question on first use of PySpark shell and example, which is failing
HI Yin, My Classpath is set to: CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils:. And there is indeed a spark-core.jar in the ../jars subdirectory, though it is not named precisely “spark-core.jar”. It has a version number in its name, as you can see: [rtaylor@bigdatann ~]$ find /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars -name "spark-core*.jar" /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-core_2.10-1.5.0-cdh5.5.1.jar I extracted the class names into a text file: [rtaylor@bigdatann jars]$ jar tf spark-core_2.10-1.5.0-cdh5.5.1.jar > /people/rtaylor/SparkWork/jar_file_listing_of_spark-core_jar.txt And then searched for RDDOperationScope. I found these classes: [rtaylor@bigdatann SparkWork]$ grep RDDOperationScope jar_file_listing_of_spark-core_jar.txt org/apache/spark/rdd/RDDOperationScope$$anonfun$5.class org/apache/spark/rdd/RDDOperationScope$$anonfun$3.class org/apache/spark/rdd/RDDOperationScope$$anonfun$4$$anonfun$apply$1.class org/apache/spark/rdd/RDDOperationScope$$anonfun$4.class org/apache/spark/rdd/RDDOperationScope$$anonfun$1.class org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$2.class org/apache/spark/rdd/RDDOperationScope$.class org/apache/spark/rdd/RDDOperationScope$$anonfun$getAllScopes$1.class org/apache/spark/rdd/RDDOperationScope.class org/apache/spark/rdd/RDDOperationScope$$anonfun$2.class [rtaylor@bigdatann SparkWork]$ It looks like the RDDOperationScope class is present. Shouldn’t that work? Ron Ronald C. Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568, email: ronald.tay...@pnnl.gov web page: http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048 From: Yin Yang [mailto:yy201...@gmail.com] Sent: Monday, February 29, 2016 2:27 PM To: Taylor, Ronald C Cc: Jules Damji; user@spark.apache.org; ronald.taylo...@gmail.com Subject: Re: a basic question on first use of PySpark shell and example, which is failing RDDOperationScope is in spark-core_2.1x jar file. 7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class Can you check whether the spark-core jar is in classpath ? FYI On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>> wrote: Hi Jules, folks, I have tried “hdfs://” as well as “file://”. And several variants. Every time, I get the same msg – NoClassDefFoundError. See below. Why do I get such a msg, if the problem is simply that Spark cannot find the text file? Doesn’t the error msg indicate some other source of the problem? I may be missing something in the error report; I am a Java person, not a Python programmer. But doesn’t it look like a call to a Java class –something associated with “o9.textFile” - is failing? If so, how do I fix this? Ron "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", line 451, in textFile return RDD(self._jsc.textFile(name, minPartitions), self, File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 36, in deco return f(*a, **kw) File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile. : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ Ronald C. Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568<tel:%28509%29%20372-6568>, email: ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov> web page: http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048 From: Jules Damji [mailto:dmat...@comcast.net<mailto:dmat...@comcast.net>] Sent: Sunday, February 28, 2016 10:07 PM To: Taylor, Ronald C Cc: user@spark.apache.org<mailto:user@spark.apache.org>; ronald.taylo...@gmail.com<mailto:ronald.taylo...@gmail.com> Subject: Re: a basic question on first use of PySpark shell and example, which is failing Hello Ronald, Since you have placed the file under HDFS, you might same change the path name to: val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java") Sent from my iPhone Pardon the dumb thumb typos :) On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov<mailto:ronald.tay...@pnnl.gov>> wrote: Hello folks, I am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I
Re: a basic question on first use of PySpark shell and example, which is failing
RDDOperationScope is in spark-core_2.1x jar file. 7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class Can you check whether the spark-core jar is in classpath ? FYI On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov> wrote: > Hi Jules, folks, > > > > I have tried “hdfs://” as well as “file:// filepath>”. And several variants. Every time, I get the same msg – > NoClassDefFoundError. See below. Why do I get such a msg, if the problem is > simply that Spark cannot find the text file? Doesn’t the error msg indicate > some other source of the problem? > > > > I may be missing something in the error report; I am a Java person, not a > Python programmer. But doesn’t it look like a call to a Java class > –something associated with “o9.textFile” - is failing? If so, how do I > fix this? > > > > Ron > > > > > > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", > line 451, in textFile > > return RDD(self._jsc.textFile(name, minPartitions), self, > > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", > line 36, in deco > > return f(*a, **kw) > > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > > py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile. > > : java.lang.NoClassDefFoundError: Could not initialize class > org.apache.spark.rdd.RDDOperationScope$ > > > > Ronald C. Taylor, Ph.D. > > Computational Biology & Bioinformatics Group > > Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) > > Richland, WA 99352 > > phone: (509) 372-6568, email: ronald.tay...@pnnl.gov > > web page: http://www.pnnl.gov/science/staff/staff_info.asp?staff_num=7048 > > > > *From:* Jules Damji [mailto:dmat...@comcast.net] > *Sent:* Sunday, February 28, 2016 10:07 PM > *To:* Taylor, Ronald C > *Cc:* user@spark.apache.org; ronald.taylo...@gmail.com > *Subject:* Re: a basic question on first use of PySpark shell and > example, which is failing > > > > > > Hello Ronald, > > > > Since you have placed the file under HDFS, you might same change the path > name to: > > > > val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java") > > > Sent from my iPhone > > Pardon the dumb thumb typos :) > > > On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C <ronald.tay...@pnnl.gov> > wrote: > > > > Hello folks, > > > > I am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster > at our lab. I am trying to use the PySpark shell for the first time. and am > attempting to duplicate the documentation example of creating an RDD > which I called "lines" using a text file. > > I placed a a text file called Warehouse.java in this HDFS location: > > > [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark > -rw-r--r-- 3 rtaylor supergroup1155355 2016-02-28 18:09 > /user/rtaylor/Spark/Warehouse.java > [rtaylor@bigdatann ~]$ > > I then invoked sc.textFile()in the PySpark shell.That did not work. See > below. Apparently a class is not found? Don't know why that would be the > case. Any guidance would be very much appreciated. > > The Cloudera Manager for the cluster says that Spark is operating in the > "green", for whatever that is worth. > > - Ron Taylor > > > >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java") > > Traceback (most recent call last): > File "", line 1, in > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", > line 451, in textFile > return RDD(self._jsc.textFile(name, minPartitions), self, > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", > line 36, in deco > return f(*a, **kw) > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile. > :
Re: a basic question on first use of PySpark shell and example, which is failing
Hello Ronald, Since you have placed the file under HDFS, you might same change the path name to: val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java") Sent from my iPhone Pardon the dumb thumb typos :) > On Feb 28, 2016, at 9:36 PM, Taylor, Ronald Cwrote: > > > Hello folks, > > I am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at > our lab. I am trying to use the PySpark shell for the first time. and am > attempting to duplicate the documentation example of creating an RDD which > I called "lines" using a text file. > > I placed a a text file called Warehouse.java in this HDFS location: > > [rtaylor@bigdatann ~]$ hadoop fs -ls /user/rtaylor/Spark > -rw-r--r-- 3 rtaylor supergroup1155355 2016-02-28 18:09 > /user/rtaylor/Spark/Warehouse.java > [rtaylor@bigdatann ~]$ > > I then invoked sc.textFile()in the PySpark shell.That did not work. See > below. Apparently a class is not found? Don't know why that would be the > case. Any guidance would be very much appreciated. > > The Cloudera Manager for the cluster says that Spark is operating in the > "green", for whatever that is worth. > > - Ron Taylor > > >>> lines = sc.textFile("file:///user/taylor/Spark/Warehouse.java") > > Traceback (most recent call last): > File "", line 1, in > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/context.py", > line 451, in textFile > return RDD(self._jsc.textFile(name, minPartitions), self, > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/sql/utils.py", > line 36, in deco > return f(*a, **kw) > File > "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o9.textFile. > : java.lang.NoClassDefFoundError: Could not initialize class > org.apache.spark.rdd.RDDOperationScope$ > at org.apache.spark.SparkContext.withScope(SparkContext.scala:709) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:825) > at > org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:191) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > > >>>
Re: initial basic question from new user
The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. for example: val baseData = rawDataRdd.map(...).flatMap(...).reduceByKey(...).persist val metric1 = baseData.flatMap(op1).reduceByKey.collect val metric2 = baseData.flatMap(op2).reduceByKey.collect Without persist, computing metric1 and metric2 would trigger the computation starting from rawData. With persist, both metric1 and metric2 will start from the intermediate result (baseData) If you need to ad-hoc persist to files, you can can save RDDs using rdd.saveAsObjectFile(...) [1] and load them afterwards using sparkContext.objectFile(...) If you want to preserve the RDDs in memory between job runs, you should look at the Spark-JobServer [3] [1] https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD [2] https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.SparkContext [3] https://github.com/ooyala/spark-jobserver On Thu, Jun 12, 2014 at 11:24 AM, Toby Douglass t...@avocet.io wrote: Gents, I am investigating Spark with a view to perform reporting on a large data set, where the large data set receives additional data in the form of log files on an hourly basis. Where the data set is large there is a possibility we will create a range of aggregate tables, to reduce the volume of data which has to be processed. Having spent a little while reading up about Spark, my thought was that I could create an RDD which is an agg, persist this to disk, have reporting queries run against that RDD and when new data arrives, convert the new log file into an agg and add that to the agg RDD. However, I begin now to get the impression that RDDs cannot be persisted across jobs - I can generate an RDD, I can persist it, but I can see no way for a later job to load a persisted RDD (and I begin to think it will have been GCed anyway, at the end of the first job). Is this correct?
Re: initial basic question from new user
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as building a long-running application which allows multiple clients to load/share/persist/save/collaborate-on RDDs satisfies a larger, more complex use case. That is indeed the job of a higher-level application, subject to a wide variety of higher-level design choices. A number of us have successfully built Spark-based apps around that model. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass t...@avocet.io wrote: On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas gerard.m...@gmail.com wrote: The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. As I understand it, Spark is designed for interactive querying, in the sense that the caching of intermediate results eliminates the need to recompute those results. However, if intermediate results last only for the duration of a job (e.g. say a python script), how exactly is interactive querying actually performed? a script is not an interactive medium. Is the shell the only medium for interactive querying? Consider a common usage case : a web-site, which offers reporting upon a large data set. Users issue arbitrary queries. A few queries (just with different arguments) dominate the query load, so we thought to create intermediate RDDs to service those queries, so only those order of magnitude or smaller RDDs would need to be processed. Where this is not possible, we can only use Spark for reporting by issuing each query over the whole data set - e.g. Spark is just like Impala is just like Presto is just like [nnn]. The enourmous benefit of RDDs - the entire point of Spark - so profoundly useful here - is not available. What a huge and unexpected loss! Spark seemingly renders itself ordinary. It is for this reason I am surprised to find this functionality is not available. If you need to ad-hoc persist to files, you can can save RDDs using rdd.saveAsObjectFile(...) [1] and load them afterwards using sparkContext.objectFile(...) I've been using this site for docs; http://spark.apache.org Here we find through the top-of-the-page menus the link API Docs - Python API which brings us to; http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html Where this page does not show the function saveAsObjectFile(). I find now from your link here; https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD What appears to be a second and more complete set of the same documentation, using a different web-interface to boot. It appears at least that there are two sets of documentation for the same APIs, where one set is out of the date and the other not, and the out of date set is that which is linked to from the main site? Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible - Spark can *read* Parquet, but I think it cannot write Parquet as a disk-based RDD format. If you want to preserve the RDDs in memory between job runs, you should look at the Spark-JobServer [3] Thankyou. I view this with some trepidation. It took two man-days to get Spark running (and I've spent another man day now trying to get a map/reduce to run; I'm getting there, but not there yet) - the bring-up/config experience for end-users is not tested or accurated documented (although to be clear, no better and no worse than is normal for open source; Spark is not exceptional). Having to bring up another open source project is a significant barrier to entry; it's always such a headache. The save-to-disk function you mentioned earlier will allow intermediate RDDs to go to disk, but we do in fact have a use case where in-memory would be useful; it might allow us to ditch Cassandra, which would be wonderful, since it would reduce the system count by one. I have to say, having to install JobServer to achieve this one end seems an extraordinarily heavyweight solution - a whole new application, when all that is wished for is that Spark persists RDDs across jobs, where so small a feature seems to open the door to so much functionality.
Re: initial basic question from new user
RE: Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible - Spark can *read* Parquet, but I think it cannot write Parquet as a disk-based RDD format. Spark can write Parquet, via the ParquetOutputFormat which is distributed from Parquet. If you'd like example code for writing out to Parquet, please see the adamSave function in https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala, starting at line 62. There is a bit of setup necessary for the Parquet write codec, but otherwise it is fairly straightforward. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Thu, Jun 12, 2014 at 7:03 AM, Christopher Nguyen c...@adatao.com wrote: Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as building a long-running application which allows multiple clients to load/share/persist/save/collaborate-on RDDs satisfies a larger, more complex use case. That is indeed the job of a higher-level application, subject to a wide variety of higher-level design choices. A number of us have successfully built Spark-based apps around that model. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass t...@avocet.io wrote: On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas gerard.m...@gmail.com wrote: The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. As I understand it, Spark is designed for interactive querying, in the sense that the caching of intermediate results eliminates the need to recompute those results. However, if intermediate results last only for the duration of a job (e.g. say a python script), how exactly is interactive querying actually performed? a script is not an interactive medium. Is the shell the only medium for interactive querying? Consider a common usage case : a web-site, which offers reporting upon a large data set. Users issue arbitrary queries. A few queries (just with different arguments) dominate the query load, so we thought to create intermediate RDDs to service those queries, so only those order of magnitude or smaller RDDs would need to be processed. Where this is not possible, we can only use Spark for reporting by issuing each query over the whole data set - e.g. Spark is just like Impala is just like Presto is just like [nnn]. The enourmous benefit of RDDs - the entire point of Spark - so profoundly useful here - is not available. What a huge and unexpected loss! Spark seemingly renders itself ordinary. It is for this reason I am surprised to find this functionality is not available. If you need to ad-hoc persist to files, you can can save RDDs using rdd.saveAsObjectFile(...) [1] and load them afterwards using sparkContext.objectFile(...) I've been using this site for docs; http://spark.apache.org Here we find through the top-of-the-page menus the link API Docs - Python API which brings us to; http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html Where this page does not show the function saveAsObjectFile(). I find now from your link here; https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD What appears to be a second and more complete set of the same documentation, using a different web-interface to boot. It appears at least that there are two sets of documentation for the same APIs, where one set is out of the date and the other not, and the out of date set is that which is linked to from the main site? Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible - Spark can *read* Parquet, but I think it cannot write Parquet as a disk-based RDD format. If you want to preserve the RDDs in memory between job runs, you should look at the Spark-JobServer [3] Thankyou. I view this with some trepidation. It took two man-days to get Spark running (and I've spent another man day now trying to get a map/reduce to run; I'm getting there, but not there yet) - the bring-up/config experience for end-users is not
Re: initial basic question from new user
Hi, On 06/12/2014 05:47 PM, Toby Douglass wrote: In these future jobs, when I come to load the aggregted RDD, will Spark load and only load the columns being accessed by the query? or will Spark load everything, to convert it into an internal representation, and then execute the query? The aforementioned native Parquet support in Spark 1.0 supports column projections which means only the columns that appear in the query will be loaded. The next release will also support record filters for simple pruning predicates (int-column smaller value and such). This is different from using a Hadoop Input/Output format and requires no additional setup (jars in classpath and such). For more details see: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet Andre
Re: initial basic question from new user
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher schum...@icsi.berkeley.edu wrote: On 06/12/2014 05:47 PM, Toby Douglass wrote: In these future jobs, when I come to load the aggregted RDD, will Spark load and only load the columns being accessed by the query? or will Spark load everything, to convert it into an internal representation, and then execute the query? The aforementioned native Parquet support in Spark 1.0 supports column projections which means only the columns that appear in the query will be loaded. [snip] Thankyou!