Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi
thanks Vadim. yes this is a good option for us. thanks From: Vadim Semenov Sent: Wednesday, August 2, 2017 6:24:40 PM To: Suzen, Mehmet Cc: jeff saremi; user@spark.apache.org Subject: Re: How can i remove the need for calling cache

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to create a new RDD that reads that data, this way you'll avoid recomputing the RDD but may lose time on saving/loading. Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient method that gives you

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 03:00, Vadim Semenov wrote: > `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it > just saves data to some destination. Yes, that's what I thought, so the statement "..otherwise saving it on a file will require

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it just saves data to some destination. `cache/persist` allow you to cache data and keep the DAG in case of some executor that holds data goes down, so Spark would still be able to recalculate missing partitions

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
Also check the `RDD.checkpoint()` method https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1550 On Wed, Aug 2, 2017 at 8:46 PM, Vadim Semenov wrote: > I'm not sure that "checkpointed" means the same thing in that

Re: How can i remove the need for calling cache

2017-08-02 Thread Vadim Semenov
I'm not sure that "checkpointed" means the same thing in that sentence. You can run a simple test using `spark-shell`: sc.setCheckpointDir("/tmp/checkpoint") val rdd = sc.parallelize(1 to 10).map(x => { Thread.sleep(1000) x }) rdd.checkpoint() rdd.foreach(println) // Will take 10 seconds

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 01:05, jeff saremi wrote: > Vadim: > > This is from the Mastering Spark book: > > "It is strongly recommended that a checkpointed RDD is persisted in memory, > otherwise saving it on a file will require recomputation." Is this really true? I had the

Re: How can i remove the need for calling cache

2017-08-02 Thread jeff saremi
Vadim: This is from the Mastering Spark book: "It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation." To me that means checkpoint will not prevent the recomputation that i was hoping for

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Holden Karau
The memory overhead is based less on the total amount of data and more on what you end up doing with the data (e.g. if your doing a lot of off-heap processing or using Python you need to increase it). Honestly most people find this number for their job "experimentally" (e.g. they try a few

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Ryan, Thank you for reply. For 2 TB of Data what should be the value of spark.yarn.executor.memoryOverhead = ? with regards to this - i see issue at spark https://issues.apache.org/jira/browse/SPARK-18787 , not sure whether it works or not at Spark 2.0.1 ! can you elaborate more for

Spark Streaming: Async action scheduling inside foreachRDD

2017-08-02 Thread Andrii Biletskyi
Hi all, What is the correct way to schedule multiple jobs inside foreachRDD method and importantly await on result to ensure those jobs have completed successfully? E.g.: kafkaDStream.foreachRDD{ rdd => val rdd1 = rdd.map(...) val rdd2 = rdd1.map(...) val job1Future = Future{

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Gourav Sengupta
Hi Steve, I have written a sincere note of apology to everyone in a separate email. I sincerely request your kind forgiveness before hand if anything does sound impolite in my emails, in advance. Let me first start by thanking you. I know it looks like I formed all my opinion based on that

Projection Pushdown and Predicate Pushdown in Parquet for Nested Column

2017-08-02 Thread Patrick
Hi, I would like to know that if Spark has support for Projection Pushdown and Predicate Pushdown in Parquet for nested column.? I can see two JIRA tasks with PR. https://issues.apache.org/jira/browse/SPARK-17636 https://issues.apache.org/jira/browse/SPARK-4502 If not, are we seeing these

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
Chetan, When you're writing to a partitioned table, you want to use a shuffle to avoid the situation where each task has to write to every partition. You can do that either by adding a repartition by your table's partition keys, or by adding an order by with the partition keys and then columns

[Debug] Upgrade from 1.4.1 to 2.0.0 breaks job

2017-08-02 Thread Druhin Goel
Hi, I just upgraded our spark version from 1.4.1 to 2.0.0. Upon running a job (which worked perfectly with 1.4.1) to test it, I see that it fails with an assertion error: "assertion failed: copyAndReset must return a zero value copy”. I saw a thread related to this on StackOverflow, however, I

UNSUBSCRIBE

2017-08-02 Thread Manikandan Vijayakumar
Please Unsubscribe me.

Re: UNSUBSCRIBE

2017-08-02 Thread Andi Levin
Writing to the list user@spark.apache.org Subscription address user-subscr...@spark.apache.org Digest subscription address user-digest-subscr...@spark.apache.org Unsubscription addresses user-unsubscr...@spark.apache.org Getting help with the list user-h...@spark.apache.org Feeds: Atom 1.0

UNSUBSCRIBE

2017-08-02 Thread DAS, SUTANU
Please Unsubscribe me.

UNSUBSCRIBE

2017-08-02 Thread DAS, SUTANU
Please Unsubscribe me.

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ravindra
either increase overall executor memory if you have scope. or try to give more % to overhead memory from default of .7. Read this for more details. On Wed, Aug 2, 2017 at 11:03 PM Chetan Khatri

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Frank Austin Nothaft
> The general idea of writing to the user group is that people who know should > answer, and not those who do not know. I’d also add that if you’re going to write to the user group, you should be polite to people who try to answer your queries, even if you think they’re wrong. This is

Re: PySpark Streaming S3 checkpointing

2017-08-02 Thread Steve Loughran
On 2 Aug 2017, at 10:34, Riccardo Ferrari > wrote: Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this: class StreamingJob(): def __init__(..): ...

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Can anyone please guide me with above issue. On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri wrote: > Hello Spark Users, > > I have Hbase table reading and writing to Hive managed table where i > applied partitioning by date column which worked fine but it has

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Steve Loughran
On 2 Aug 2017, at 14:25, Gourav Sengupta > wrote: Hi, I am definitely sure that at this point of time everyone who has kindly cared to respond to my query do need to go and check this link

Re: Runnig multiple spark jobs on yarn

2017-08-02 Thread Jörn Franke
And if the yarn queues are configured as such > On 2. Aug 2017, at 16:47, ayan guha wrote: > > Each of your spark-submit will create separate applications in YARN and run > concurrently (if you have enough resource, that is) > >> On Thu, Aug 3, 2017 at 12:42 AM, serkan

[PySpark] Multiple driver cores

2017-08-02 Thread Judit Planas
Hello, I recently came across the "--driver-cores" option when, for example, launching a PySpark shell. Provided that there are idle CPUs on driver's node, what would be the benefit of having multiple driver cores? For example, will this accelerate the

Re: Runnig multiple spark jobs on yarn

2017-08-02 Thread ayan guha
Each of your spark-submit will create separate applications in YARN and run concurrently (if you have enough resource, that is) On Thu, Aug 3, 2017 at 12:42 AM, serkan taş wrote: > Hi, > > Where should i start to be able to run multiple spark jobs concurrent on 3 > node

Runnig multiple spark jobs on yarn

2017-08-02 Thread serkan taş
Hi, Where should i start to be able to run multiple spark jobs concurrent on 3 node spark cluster? Android için Outlook uygulamasını edinin

UNSUBSCRIBE

2017-08-02 Thread Jnana Sagar
Please Unsubscribe me. -- regards Jnana Sagar

Reading spark-env.sh from configured directory

2017-08-02 Thread Lior Chaga
Hi, I have multiple spark deployments using mesos. I use spark.executor.uri to fetch the spark distribution to executor node. Every time I upgrade spark, I download the default distribution, and just add to it custom spark-env.sh to spark/conf folder. Further more, any change I want to do in

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Gourav Sengupta
Hi, I am definitely sure that at this point of time everyone who has kindly cared to respond to my query do need to go and check this link https://spark.apache.org/docs/2.2.0/spark-standalone.html#spark-standalone-mode . It does mention that SPARK standalone cluster can have multiple machines

Re: Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
Hey Jörn, The "pending" was more something like a flag like myDf.hasCatalystWorkToDo() or myDf.isPendingActions(). Maybe an access to the DAG? I just did that: ordersDf = ordersDf.withColumn( "time_to_ship", datediff(ordersDf.col("ship_date"), ordersDf.col("order_date")));

Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Hello Spark Users, I have Hbase table reading and writing to Hive managed table where i applied partitioning by date column which worked fine but it has generate more number of files in almost 700 partitions but i wanted to use reparation to reduce File I/O by reducing number of files inside each

Re: Quick one on evaluation

2017-08-02 Thread Jörn Franke
I assume printschema would not trigger an evaluation. Show might partially triggger an evaluation (not all data is shown only a certain number of rows by default). Keep in mind that even a count might not trigger evaluation of all rows (especially in the future) due to updates on the optimizer.

Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
Hi Sparkians, I understand the lazy evaluation mechanism with transformations and actions. My question is simpler: 1) are show() and/or printSchema() actions? I would assume so... and optional question: 2) is there a way to know if there are transformations "pending"? Thanks! jg

Inserting content of df to partitioned hive table (parquet format)

2017-08-02 Thread Jens Johannsen
Hi All, I'm trying to insert the content of a dataframe to a partitioned parquet-formatted hive table using *df.write.mode(SaveMode.Append).insertInto(myTable)* with *hive.exec.dynamic.partition = 'true' * and *hive.exec.dynamic.partition.mode = 'nonstrict'.* I keep getting an

Spark-twitter Streaming

2017-08-02 Thread deependra singh
Hello everyone, This is in reference to spark-twitter streaming. val stream = TwitterUtils.createStream(ssc, None) can anybody tell me why this dstream is no a proper JSON object as I am not able to parse it further. spark-version = 2.0.1 spark-api =scala streaming jar = org.apache.bahir

PySpark Streaming S3 checkpointing

2017-08-02 Thread Riccardo Ferrari
Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this: class StreamingJob(): def __init__(..): ... sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key',)