from:"Ricardo Paiva"

Re: Calling spark from a java web application.

2016-03-31 Thread Ricardo Paiva

$SPARK_HOME/conf/log4j.properties

It uses by default $SPARK_HOME/conf/log4j.properties.template

On Thu, Mar 31, 2016 at 3:28 PM, arul_anand_2000 [via Apache Spark User
List] <ml-node+s1001560n2664...@n3.nabble.com> wrote:

> Can you please let me know how the log4j properties where configured. I am
> trying to integrate spark with web application. when spark context gets
> initialized, it overrides existing log4j properites.
>
> Regards,
> arul.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p26649.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-spark-from-a-java-web-application-tp20007p26650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Aggregations/Joins

2016-03-07 Thread Ricardo Paiva

Have you ever tried to use join? Both RDD and Dataframe have this method
and it does a join like traditional relational database does.

On Sat, Mar 5, 2016 at 3:17 AM, Agro [via Apache Spark User List] <
ml-node+s1001560n26403...@n3.nabble.com> wrote:

> So, initially, I have an RDD[Int] that I've loaded from my database, where
> each Int is a user ID. For each of these user IDs, I need to gather a bunch
> of other data (a list of recommended product IDs), which makes use of an
> RDD as well. I've tried doing this out, but Spark doesn't allow nesting RDD
> operations on two different RDDs together. I feel like this a common
> problem, so are there any general solutions you guys know about?
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Aggregations-Joins-tp26403.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Aggregations-Joins-tp26403p26418.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Ricardo Paiva

I use the plain and old Junit

Spark batch example:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.junit.AfterClass
import org.junit.Assert.assertEquals
import org.junit.BeforeClass
import org.junit.Test

object TestMyCode {

  var sc: SparkContext = null

  @BeforeClass
  def setup(): Unit = {
val sparkConf = new SparkConf()
  .setAppName("Test Spark")
  .setMaster("local[*]")
sc = new SparkContext(sparkConf)
  }

  @AfterClass
  def cleanup(): Unit = {
sc.stop()
  }
}

class TestMyCode {

  @Test
  def testSaveNumbersToExtractor(): Unit = {
val sql = new SQLContext(TestDataframeToTableau.sc)
import sql.implicits._

val numList = List(1, 2, 3, 4, 5)
val df = TestDataframeToTableau.sc.parallelize(numList).toDF
val numDf = df.select(df("_1").alias("num"))
assertEquals(5, numDf.count)
  }

}

On Wed, Mar 2, 2016 at 2:54 PM, SRK [via Apache Spark User List] <
ml-node+s1001560n26380...@n3.nabble.com> wrote:

> Hi,
>
> What is a good unit testing framework for Spark batch/streaming jobs? I
> have core spark, spark sql with dataframes and streaming api getting used.
> Any good framework to cover unit tests for these APIs?
>
> Thanks!
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380p26383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SPARK REST API on YARN

2016-02-18 Thread Ricardo Paiva

You can use the yarn proxy:

http://
:8088/proxy//api/v1/applications//executors

I have an scala application that monitor the number of executors of some
spark streamings, and I had a similar problem, where I iterate over the
running jobs and get the number of executors:

val states = EnumSet.of(
  YarnApplicationState.RUNNING,
  YarnApplicationState.ACCEPTED)
val it = yarnClient.getApplications(states).iterator()
while (it.hasNext()) {
val app = it.next()
val strUrl = appReport.getTrackingUrl + "api/v1/applications/" +
appReport.getName + "/executors"
val url = new URL(strUrl)
val urlCon = url.openConnection()
val content = fromInputStream(urlCon.getInputStream)
.getLines.mkString("\n")
val j = JSON.parseFull(content)
val currentExecutors = j.get.asInstanceOf[List[Map[String,
String]]].filterNot(_("id") == "driver").size


Regards,

Ricardo

On Thu, Feb 18, 2016 at 1:56 PM, alvarobrandon [via Apache Spark User List]
<ml-node+s1001560n26267...@n3.nabble.com> wrote:

> Hello:
>
> I wanted to access the REST API (
> http://spark.apache.org/docs/latest/monitoring.html#rest-api) of Spark to
> monitor my jobs. However I'm running my Spark Apps over YARN. When I try to
> make a request to http://localhost:4040/api/v1 as the documentation says
> I don't get any response. My question is. It is possible to access this
> REST API when you are not using Spark in Standalone mode?
>
> Thanks in advance
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-REST-API-on-YARN-tp26267.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-REST-API-on-YARN-tp26267p26268.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: using spark context in map funciton TASk not serilizable error

2016-01-19 Thread Ricardo Paiva

Did you try SparkContext.getOrCreate() ?

You don't need to pass the sparkContext to the map function, you can
retrieve it from the SparkContext singleton.

Regards,

Ricardo


On Mon, Jan 18, 2016 at 6:29 PM, gpatcham [via Apache Spark User List] <
ml-node+s1001560n25998...@n3.nabble.com> wrote:

> Hi,
>
> I have a use case where I need to pass sparkcontext in map function
>
> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>
> Method1 needs spark context to query cassandra. But I see below error
>
> java.io.NotSerializableException: org.apache.spark.SparkContext
>
> Is there a way we can fix this ?
>
> Thanks
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998p26006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using JDBC clients with "Spark on Hive"

2016-01-18 Thread Ricardo Paiva

Are you running the Spark Thrift JDBC/ODBC server?

In my environment I have a Hive Metastore server and the Spark Thrift
Server pointing to the Hive Metastore.

I use the Hive beeline tool for testing. With this setup I'm able to use
Tableau connecting to Hive tables and using Spark SQL as the engine.

Regards,

Ricardo


On Thu, Jan 14, 2016 at 11:15 PM, sdevashis [via Apache Spark User List] <
ml-node+s1001560n25976...@n3.nabble.com> wrote:

> Hello Experts,
>
> I am getting started with Hive with Spark as the query engine. I built the
> package from sources. I am able to invoke Hive CLI and run queries and see
> in Ambari that Spark application are being created confirming hive is using
> Spark as the engine.
>
> However other than Hive CLI, I am not able to run queries from any other
> clients that use the JDBC to connect to hive through thrift. I tried
> Squirrel, Aginity Netezza workbench, and even Hue.
>
> No yarn applications are getting created, the query times out after
> sometime. Nothing gets into /tmp/user/hive.log Am I missing something?
>
> Again I am using Hive on Spark and not spark SQL.
>
> Version Info:
> Spark 1.4.1 built for Hadoop 2.4
>
>
> Thank you in advance for any pointers.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-JDBC-clients-with-Spark-on-Hive-tp25976.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-JDBC-clients-with-Spark-on-Hive-tp25976p25988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming: BatchDuration and Processing time

2016-01-18 Thread Ricardo Paiva

If you are using Kafka as the message queue, Spark will process accordingly
the time slices, even if it is late, like in your example. But it will fail
sometime, due the fact that your process will ask for a message that is
older than the oldest message in Kafka.
If your process takes longer than the streaming time, let's say at your
system peak time during day, but it takes much less time at night, when
your system is mostly idle, the streaming will work and process correctly
(though it's risky if the late time slices don't finish during the idle
time).

Best thing to do is try to optimize your job to fit at the time streaming
time and avoid overflows. :)

Regards,

Ricardo

On Sun, Jan 17, 2016 at 2:32 PM, pyspark2555 [via Apache Spark User List] <
ml-node+s1001560n25986...@n3.nabble.com> wrote:

> Hi,
>
> If BatchDuration is set to 1 second in StreamingContext and the actual
> processing time is longer than one second, then how does Spark handle that?
>
> For example, I am receiving a continuous Input stream. Every 1 second
> (batch duration), the RDDs will be processed. What if this processing time
> is longer than 1 second? What happens in the next batch duration?
>
> Thanks.
> Amit
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchDuration-and-Processing-time-tp25986.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchDuration-and-Processing-time-tp25986p25989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark streaming: Fixed time aggregation & handling driver failures

2016-01-18 Thread Ricardo Paiva

I don't know if this is the most efficient way to do that, but you can use
a sliding window that is bigger than your aggregation period and filter
only for the messages inside the period.

Remember that to work with the reduceByKeyAndWindow you need to associate
each row with the time key, in your case "MMddhhmm".

http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Hope it helps,

Regards,

Ricardo



On Sat, Jan 16, 2016 at 1:13 AM, ffarozan [via Apache Spark User List] <
ml-node+s1001560n25982...@n3.nabble.com> wrote:

> I am implementing aggregation using spark streaming and kafka. My batch
> and window size are same. And the aggregated data is persisted in
> Cassandra.
>
> I want to aggregate for fixed time windows - 5:00, 5:05, 5:10, ...
>
> But we cannot control when to run streaming job, we only get to specify
> the batch interval.
>
> So the problem is - lets say if streaming job starts at 5:02, then I will
> get results at 5:07, 5:12, etc. and not what I want.
>
> Any suggestions?
>
> thanks,
> Firdousi
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982p25990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Ricardo Paiva

Thanks Cody.

You confirmed that I'm not doing something wrong. I will keep investigating
and if I find something I let everybody know.

Thanks again.

Regards,

Ricardo

On Mon, Sep 14, 2015 at 6:29 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Yeah, looks like you're right about being unable to change those.  Upon
> further reading, even though StreamingContext.getOrCreate makes an entirely
> new spark conf, Checkpoint will only reload certain properties.
>
> I'm not sure if it'd be safe to include memory / cores among those
> properties that get re-loaded, TD would be a better person to ask.
>
> On Mon, Sep 14, 2015 at 2:54 PM, Ricardo Paiva <
> ricardo.pa...@corp.globo.com> wrote:
>
>> Hi Cody,
>>
>> Thanks for your answer.
>>
>> I had already tried to change the spark submit parameters, but I double
>> checked to reply your answer. Even changing properties file or directly on
>> the spark-submit arguments, none of them work when the application runs
>> from the checkpoint. It seems that everything is cached. I changed driver
>> memory, executor memory, executor cores and number of executors.
>>
>> So, the scenario I have today is: once the Spark Streaming application
>> retrieves the data from the checkpoint, I can't change the submission
>> parameters neither the code parameters without remove the checkpoint
>> folder, loosing all the data used by windowed functions. I was wondering
>> what kind of parameters are you guys loading from the configuration file,
>> when using checkpoints.
>>
>> I really appreciate all the help on this.
>>
>> Many thanks,
>>
>> Ricardo
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 11, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Yeah, it makes sense that parameters that are read only during your
>>> getOrCCreate function wouldn't be re-read, since that function isn't called
>>> if a checkpoint is loaded.
>>>
>>> I would have thought changing number of executors and other things used
>>> by spark-submit would work on checkpoint restart.  Have you tried both
>>> changing them in the properties file provided to spark submit, and the
>>> --arguments that correspond to number of cores / executor memory?
>>>
>>> On Thu, Sep 10, 2015 at 5:23 PM, Ricardo Luis Silva Paiva <
>>> ricardo.pa...@corp.globo.com> wrote:
>>>
>>>>
>>>> Hi guys,
>>>>
>>>> I tried to use the configuration file, but it didn't work as I
>>>> expected. As part of the Spark Streaming flow, my methods run only when the
>>>> application is started the first time. Once I restart the app, it reads
>>>> from the checkpoint and all the dstream operations come from the cache. No
>>>> parameter is reloaded.
>>>>
>>>> I would like to know if it's possible to reset the time of windowed
>>>> operations, checkpoint time etc. I also would like to change the submission
>>>> parameters, like number of executors, memory per executor or driver etc. If
>>>> it's not possible, what kind of parameters do you guys usually use in a
>>>> configuration file. I know that the streaming interval it not possible to
>>>> be changed.
>>>>
>>>> This is my code:
>>>>
>>>> def main(args: Array[String]): Unit = {
>>>>   val ssc = StreamingContext.getOrCreate(CHECKPOINT_FOLDER,
>>>> createSparkContext _)
>>>>   ssc.start()
>>>>   ssc.awaitTermination()
>>>>   ssc.stop()
>>>> }
>>>>
>>>> def createSparkContext(): StreamingContext = {
>>>>   val sparkConf = new SparkConf()
>>>>  .setAppName(APP_NAME)
>>>>  .set("spark.streaming.unpersist", "true")
>>>>   val ssc = new StreamingContext(sparkConf, streamingInterval)
>>>>   ssc.checkpoint(CHECKPOINT_FOLDER)
>>>>   ssc.sparkContext.addFile(CONFIG_FILENAME)
>>>>
>>>>   val rawStream = createKafkaRDD(ssc)
>>>>   processAndSave(rawStream)
>>>>   return ssc
>>>> }
>>>>
>>>> def processAndSave(rawStream:DStream[(String, Array[Byte])]): Unit = {
>>>>
>>>>   val configFile = SparkFiles.get("config.properties")
>>>>   val config:Config = ConfigFactory.parseFile(new File(configFile))
>>>>
>>>>
>>>> *  slidingInterval =
>>>> Minutes(config.getI

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Ricardo Paiva

 changing the jar you can't recover the
>>>> checkpoint.
>>>>
>>>> If you're just changing parameters, why not externalize those in a
>>>> configuration file so your jar doesn't change?  I tend to stick even my
>>>> app-specific parameters in an external spark config so everything is in one
>>>> place.
>>>>
>>>> On Wed, Sep 2, 2015 at 4:48 PM, Ricardo Luis Silva Paiva <
>>>> ricardo.pa...@corp.globo.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is there a way to submit an app code change, keeping the checkpoint
>>>>> data or do I need to erase the checkpoint folder every time I re-submit 
>>>>> the
>>>>> spark app with a new jar?
>>>>>
>>>>> I have an app that count pageviews streaming from Kafka, and deliver a
>>>>> file every hour from the past 24 hours. I'm using reduceByKeyAndWindow 
>>>>> with
>>>>> the reduce and inverse functions set.
>>>>>
>>>>> I'm doing some code improvements and would like to keep the data from
>>>>> the past hours, so when I re-submit a code change, I would keep delivering
>>>>> the pageviews aggregation without need to wait for 24 hours of new data.
>>>>> Sometimes I'm just changing the submission parameters, like number of
>>>>> executors, memory and cores.
>>>>>
>>>>> Many thanks,
>>>>>
>>>>> Ricardo
>>>>>
>>>>> --
>>>>> Ricardo Paiva
>>>>> Big Data / Semântica
>>>>> *globo.com* <http://www.globo.com>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ricardo Paiva
>>> Big Data / Semântica
>>> *globo.com* <http://www.globo.com>
>>>
>>
>>
>>
>> --
>> Ricardo Paiva
>> Big Data / Semântica
>> *globo.com* <http://www.globo.com>
>>
>
>


-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>

Re: Calling spark from a java web application.

Re: Spark Aggregations/Joins

Re: Unit testing framework for Spark Jobs?

Re: SPARK REST API on YARN

Re: using spark context in map funciton TASk not serilizable error

Re: Using JDBC clients with "Spark on Hive"

Re: Spark Streaming: BatchDuration and Processing time

Re: Spark streaming: Fixed time aggregation & handling driver failures

Re: Is it required to remove checkpoint when submitting a code change?

Re: Is it required to remove checkpoint when submitting a code change?

10 matches

Site Navigation

Mail list logo

Footer information