You can take a look at the code that Spark generates:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.debug.codegenString
val spark: SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
val data = Seq("A","b","c").toDF("col")
data.write.par
is some proper Spark dev out there who is looking for a problem to
> solve.
>
> On Fri, Nov 8, 2019 at 2:24 PM Vadim Semenov
> wrote:
>>
>> Basically, the driver tracks partitions and sends it over to
>> executors, so what it's trying to do is to serialize and comp
Basically, the driver tracks partitions and sends it over to
executors, so what it's trying to do is to serialize and compress the
map but because it's so big, it goes over 2GiB and that's Java's limit
on the max size of byte arrays, so the whole thing drops.
The size of data doesn't matter here m
that size but... who knows). I will try it
> with your suggestions and see if it solves the problem.
>
> thanks,
> Jerry
>
> On Tue, Sep 17, 2019 at 4:34 PM Vadim Semenov wrote:
>
>> Pre-register your classes:
>>
>> ```
>> import com.esotericsoftwar
Pre-register your classes:
```
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo): Unit = {
kryo.register(Class.forName("[[B")) // byte[][]
kryo.register(clas
Try "spark.shuffle.io.numConnectionsPerPeer=10"
On Fri, Aug 30, 2019 at 10:22 AM Daniel Zhang wrote:
> Hi, All:
> We are testing the EMR and compare with our on-premise HDP solution. We
> use one application as the test:
> EMR (5.21.1) with Hadoop 2.8.5 + Spark 2.4.3 vs HDP (2.6.3) with Hadoop
>
This is what you're looking for:
Handle large corrupt shuffle blocks
https://issues.apache.org/jira/browse/SPARK-26089
So until 3.0 the only way I can think of is to reduce the size/split your
job into many
On Thu, Aug 15, 2019 at 4:47 PM Mikhail Pryakhin
wrote:
> Hello, Spark community!
>
> I
just set up a filter
[image: Screen Shot 2019-06-24 at 4.51.20 PM.png]
On Mon, Jun 24, 2019 at 4:46 PM Jeff Evans
wrote:
> There seem to be a lot of people trying to unsubscribe via the main
> address, rather than following the instructions from the welcome
> email. Of course, this is not all t
next spark summit
On Thu, Jun 13, 2019 at 3:58 AM Alex Dettinger
wrote:
> Follow up on the release date for Spark 3. Any guesstimate or rough
> estimation without commitment would be helpful :)
>
> Cheers,
> Alex
>
> On Mon, Jun 10, 2019 at 5:24 PM Alex Dettinger
> wrote:
>
>> Hi guys,
>>
>>
saving/checkpointing would be preferable in case of a big data set because:
- the RDD gets saved to HDFS and the DAG gets truncated so if some
partitions/executors fail it won't result in recomputing everything
- you don't use memory for caching therefore the JVM heap is going to be
smaller which
I/We have seen this error before on 1.6 but ever since we upgraded to 2.1
two years ago we haven't seen it
On Tue, Mar 12, 2019 at 2:19 AM wangfei wrote:
> Hi all,
> Non-deterministic FAILED_TO_UNCOMPRESS(5) or ’Stream is corrupted’
> errors
> may occur during shuffle read, described as t
>
> 1) Is there any difference in terms performance when we use datasets over
> dataframes? Is it significant to choose 1 over other. I do realise there
> would be some overhead due case classes but how significant is that? Are
> there any other implications.
As long as you use the DataFrame func
Yeah, the filter gets infront of the select after analyzing
scala> b.where($"bar" === 20).explain(true)
== Parsed Logical Plan ==
'Filter ('bar = 20)
+- AnalysisBarrier
+- Project [foo#6]
+- Project [_1#3 AS foo#6, _2#4 AS bar#7]
+- SerializeFromObject [assertnotnull(ass
7;t any away to inject "poison pill" into repartition call :(
>
> пн, 11 февр. 2019 г. в 21:19, Vadim Semenov :
>>
>> something like this
>>
>> import org.apache.spark.TaskContext
>> ds.map(r => {
>> val taskContext = TaskContext.get()
>>
something like this
import org.apache.spark.TaskContext
ds.map(r => {
val taskContext = TaskContext.get()
if (taskContext.partitionId == 1000) {
throw new RuntimeException
}
r
})
On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak wrote:
>
> I need to crash task which does repartition.
>
Hey Conrad,
has it started happening recently?
We recently started having some sporadic problems with drivers on EMR
when it gets stuck, up until two weeks ago everything was fine.
We're trying to figure out with the EMR team where the issue is coming from.
On Tue, Nov 27, 2018 at 6:29 AM Conrad
You can use checkpointing, in this case Spark will write out an rdd to
whatever destination you specify, and then the RDD can be reused from the
checkpointed state avoiding recomputing.
On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann <
i...@ricobergmann.de> wrote:
> Thanks for your advis
Here you go:
the umbrella ticket:
https://issues.apache.org/jira/browse/SPARK-24417
and the sun.misc.unsafe one
https://issues.apache.org/jira/browse/SPARK-24421
On Wed, Oct 24, 2018 at 8:08 PM kant kodali wrote:
>
> Hi All,
>
> Does Spark have a plan to move away from sun.misc.Unsafe to VarHandl
You have too many partitions, so when the driver is trying to gather
the status of all map outputs and send back to executors it chokes on
the size of the structure that needs to be GZipped, and since it's
bigger than 2GiB, it produces OOM.
On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman wrote:
>
>
one of the spills becomes bigger than 2GiB and can't be loaded fully
(as arrays in Java can't have more than 2^32 values)
>
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:76)
You can try increasing the number of partitions, so sp
`coalesce` sets the number of partitions for the last stage, so you
have to use `repartition` instead which is going to introduce an extra
shuffle stage
On Wed, Aug 8, 2018 at 3:47 PM Koert Kuipers wrote:
>
> one small correction: lots of files leads to pressure on the spark driver
> program whe
That’s the max size of a byte array in Java, limited by the length which is
defined as integer, and in most JVMS arrays can’t hold more than
Int.MaxValue - 8 elements. Other way to overcome this is to create multiple
broadcast variables
On Sunday, August 5, 2018, klrmowse wrote:
> i don't need m
object MyDatabseSingleton {
@transient
lazy val dbConn = DB.connect(…)
`transient` marks the variable to be excluded from serialization
and `lazy` would open connection only when it's needed and also makes
sure that the val is thread-safe
http://fdahms.com/2015/10/14/scala-and-the-transi
`spark.worker.cleanup.enabled=true` doesn't work for YARN.
On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote:
>
> I am trying to do few (union + reduceByKey) operations on a hiearchical
> dataset in a iterative fashion in rdd. The first few loops run fine but on
> the subsequent loops, the operat
That usually happens when you have different types for a column in some
parquet files.
In this case, I think you have a column of `Long` type that got a file with
`Integer` type, I had to deal with similar problem once.
You would have to cast it yourself to Long.
On Mon, Jul 9, 2018 at 2:53 PM Nir
Try doing `unpersist(blocking=true)`
On Mon, Jul 9, 2018 at 2:59 PM Jeffrey Charles
wrote:
>
> I'm persisting a dataframe in Zeppelin which has dynamic allocation enabled
> to get a sense of how much memory the dataframe takes up. After I note the
> size, I unpersist the dataframe. For some reas
You can't change the executor/driver cores/memory on the fly once
you've already started a Spark Context.
On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu wrote:
>
> We aren't using Oozie or similar, moreover, the end to end job shall be
> exactly the same, but the data will be extremely different (num
As typical `JAVA_OPTS` you need to pass as a single parameter:
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:-ResizePLAB"
Also you got an extra space in the parameter, there should be no space
after the colon symbol
On Tue, Jul 3, 2018 at 3:01 AM Aakash Basu wrote:
>
> Hi,
>
> I used
Try using `TaskContext`:
import org.apache.spark.TaskContext
val partitionId = TaskContext.getPartitionId()
On Mon, Jun 25, 2018 at 11:17 AM Lalwani, Jayesh
wrote:
>
> We are trying to add a column to a Dataframe with some data that is seeded by
> some random data. We want to be able to control
Is there a way to write rows directly into off-heap memory in the Tungsten
format bypassing creating objects?
I have a lot of rows, and right now I'm creating objects, and they get
encoded, but because of the number of rows, it creates significant pressure
on GC. I'd like to avoid creating objects
Yeah, it depends on what you want to do with that timeseries data. We at
Datadog process trillions of points daily using Spark, I cannot really go
about what exactly we do with the data, but just saying that Spark can
handle the volume, scale well and be fault-tolerant, albeit everything I
said com
Upon downsizing to 20 partitions some of your partitions become too big,
and I see that you're doing caching, and executors try to write big
partitions to disk, but fail because they exceed 2GiB
> Caused by: java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
at sun.nio.ch.FileChann
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala#L38-L47
It's called String Interpolation
See "Advanced Usage" here
https://docs.scala-lang.org/overviews/core/string-interpolation.html
On Fri, May 4, 2018 at 10:10 AM, Christopher Piggott
You need to pass config before creating a session
val conf = new SparkConf()
// All three methods below are equivalent
conf.set("spark.executor.extraJavaOptions", "-Dbasicauth=myuser:mypassword")
conf.set("spark.executorEnv.basicauth", "myuser:mypassword")
conf.setExecutorEnv("basicauth", "myuser:
`.collect` returns an Array, and array's can't have more than Int.MaxValue
elements, and in most JVMs it's lower: `Int.MaxValue - 8`
So it puts upper limit, however, you can just create Array of Arrays, and
so on, basically limitless, albeit with some gymnastics.
You can not change dynamically the number of cores per executor or cores
per task, but you can change the number of executors.
In one of my jobs I have something like this, so when I know that I don't
need more than 4 executors, I kill all other executors (assuming that they
don't hold any cached
This question should be directed to the `spark-jobserver` group:
https://github.com/spark-jobserver/spark-jobserver#contact
They also have a gitter chat.
Also include the errors you get once you're going to be asking them a
question
On Wed, Mar 14, 2018 at 1:37 PM, sujeet jog wrote:
>
> Input
But overall, I think the original approach is not correct.
If you get a single file in 10s GB, the approach is probably must be
reworked.
I don't see why you can't just write multiple CSV files using Spark, and
then concatenate them without Spark
On Fri, Mar 9, 2018 at 10:02 AM, Vad
Thanks
> Deepak
>
> On Fri, Mar 9, 2018, 20:12 Vadim Semenov wrote:
>
>> because `coalesce` gets propagated further up in the DAG in the last
>> stage, so your last stage only has one task.
>>
>> You need to break your DAG so your expensive operations would be in
because `coalesce` gets propagated further up in the DAG in the last stage,
so your last stage only has one task.
You need to break your DAG so your expensive operations would be in a
previous stage before the stage with `.coalesce(1)`
On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <
rezaul.ka.
You need to put randomness into the beginning of the key, if you put it
other than into the beginning, it's not guaranteed that you're going to
have good performance.
The way we achieved this is by writing to HDFS first, and then having a
custom DistCp implemented using Spark that copies parquet f
Do you have a trace? i.e. what's the source of `io.netty.*` calls?
And have you tried bumping `-XX:MaxDirectMemorySize`?
On Tue, Mar 6, 2018 at 12:45 AM, Chawla,Sumit
wrote:
> Hi All
>
> I have a job which processes a large dataset. All items in the dataset
> are unrelated. To save on cluster
Something like this?
sparkSession.experimental.extraStrategies = Seq(Strategy)
val logicalPlan = df.logicalPlan
val newPlan: LogicalPlan = Strategy(logicalPlan)
Dataset.ofRows(sparkSession, newPlan)
On Thu, Mar 1, 2018 at 8:20 PM, Keith Chapman
wrote:
> Hi,
>
> I'd like to write a custom Spa
Yeah, without actually seeing what's happening on that line, it'd be
difficult to say for sure.
You can check what patches HortonWorks applied, or/and ask them.
And yeah, seg fault is totally possible on any size of the data. But you
should've seen it in the `stdout` (assuming that the regular lo
Who's your spark provider? EMR, Azure, Databricks, etc.? Maybe contact
them, since they've probably applied some patches
Also have you checked `stdout` for some Segfaults? I vaguely remember
getting `Task failed while writing rows at` and seeing some segfaults that
caused that
On Wed, Feb 28, 201
I'm sorry, didn't see `Caused by:
java.lang.NullPointerException at
org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)`
Are you sure that you use 2.2.0?
I don't see any code on that line
https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sq
There should be another exception trace (basically, the actual cause) after
this one, could you post it?
On Wed, Feb 28, 2018 at 1:39 PM, unk1102 wrote:
> Hi I am getting the following exception when I try to write DataFrame using
> the following code. Please guide. I am using Spark 2.2.0.
>
> d
Am I missing to relate here.
>
> What I m thinking now is number of vote = number of threads.
>
>
>
> On Mon, 26 Feb 2018 at 18:45, Vadim Semenov wrote:
>
>> All used cores aren't getting reported correctly in EMR, and YARN itself
>> has no control over it, s
All used cores aren't getting reported correctly in EMR, and YARN itself
has no control over it, so whatever you put in `spark.executor.cores` will
be used,
but in the ResourceManager you will only see 1 vcore used per nodemanager.
On Mon, Feb 26, 2018 at 5:20 AM, Selvam Raman wrote:
> Hi,
>
> s
The other way might be to launch a single SparkContext and then run jobs
inside of it.
You can take a look at these projects:
-
https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
- http://livy.incubator.apache.org
Problems with this way:
Functions are still limited to 22 arguments
https://github.com/scala/scala/blob/2.13.x/src/library/scala/Function22.scala
On Tue, Dec 26, 2017 at 2:19 PM, Felix Cheung
wrote:
> Generally the 22 limitation is from Scala 2.10.
>
> In Scala 2.11, the issue with case class is fixed, but with that s
Ah, yes, I missed that part
it's `spark.local.dir`
spark.local.dir /tmp Directory to use for "scratch" space in Spark,
including map output files and RDDs that get stored on disk. This should be
on a fast, local disk in your system. It can also be a comma-separated list
of multiple directories on
at 10:08 AM, Mihai Iacob wrote:
> When does spark remove them?
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics
>
>
>
> - Original message -
> From: Vadim Semenov
> To: Mihai Iacob
>
Spark doesn't remove intermediate shuffle files if they're part of the same
job.
On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob wrote:
> This code generates files under /tmp...blockmgr... which do not get
> cleaned up after the job finishes.
>
> Anything wrong with the code below? or are there any
I think it means that we can replace HDFS with a blockchain-based FS, and
then offload some processing to smart contracts.
On Mon, Dec 18, 2017 at 11:59 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> I am looking for same answer too .. will wait for response from other
> people
>
>
getAs defined as:
def getAs[T](i: Int): T = get(i).asInstanceOf[T]
and when you do toString you call Object.toString which doesn't depend on
the type,
so asInstanceOf[T] get dropped by the compiler, i.e.
row.getAs[Int](0).toString -> row.get(0).toString
we can confirm that by writing a simple s
not possible, but you can add your own object in your project to the
spark's package that would give you access to private methods
package org.apache.spark.sql
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.LogicalRDD
import
You can pass `JAVA_HOME` environment variable
`spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0`
On Wed, Nov 29, 2017 at 10:54 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Hi,
>
> I am running cloudera version of spark2.1 and our cluster is on JDK1.7.
> For some of the librari
The error message seems self-explanatory, try to figure out what's the disk
quota you have for your user.
On Wed, Nov 22, 2017 at 8:23 AM, Chetan Khatri
wrote:
> Anybody reply on this ?
>
> On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>>
>> Hello Spark
Try:
Class.forName("[Lorg.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$SerializableFileStatus$SerializableBlockLocation;")
On Sun, Nov 19, 2017 at 3:24 PM, Angel Francisco Orta <
angel.francisco.o...@gmail.com> wrote:
> Hello, I'm with spark 2.1.0 with scala and I'm register
There's a lot of off-heap memory involved in decompressing Snappy,
compressing ZLib.
Since you're running using `local[*]`, you process multiple tasks
simultaneously, so they all might consume memory.
I don't think that increasing heap will help, since it looks like you're
hitting system memory l
It's actually quite simple to answer
> 1. Is Spark SQL and UDF, able to handle all the workloads?
Yes
> 2. What user interface did you provide for data scientist, data engineers
and analysts
Home-grown platform, EMR, Zeppelin
> What are the challenges in running concurrent queries, by many users
When you do `Dataset.rdd` you actually create a new job
here you can see what it does internally:
https://github.com/apache/spark/blob/master/sql/core/
src/main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala
wrote:
> Hi Weichen,
>
> Thank
that's probably better be directed to the AWS support
On Sun, Oct 8, 2017 at 9:54 PM, Tushar Sudake wrote:
> Hello everyone,
>
> I'm using 'r4.8xlarge' instances on EMR for my Spark Application.
> To each node, I'm attaching one 512 GB EBS volume.
>
> By logging in into nodes I tried verifying t
Try increasing the `spark.yarn.am.waitTime` parameter, it's by default set
to 100ms which might not be enough in certain cases.
On Tue, Oct 10, 2017 at 7:02 AM, Debabrata Ghosh
wrote:
> Hi All,
> I am constantly hitting an error : "ApplicationMaster:
> SparkContext did not in
I usually check the list of Hive UDFs as Spark has implemented almost all
of them
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
Or/and check `org.apache.spark.sql.functions` directly:
https://github.com/apache/spark/blob/master/sql/core/src/mai
Since you're using Dataset API or RDD API, they won't be fused together by
the Catalyst optimizer unless you use the DF API.
Two filters will get executed within one stage, and there'll be very small
overhead on having two separate filters vs having only one.
On Tue, Oct 3, 2017 at 8:14 AM, Ahmed
les simultaneously.
See more about scheduling within one application:
https://spark.apache.org/docs/2.2.0/job-scheduling.html#
scheduling-within-an-application
On Fri, Sep 29, 2017 at 12:58 PM, Jeroen Miller
wrote:
> On Thu, Sep 28, 2017 at 11:55 PM, Jeroen Miller
> wrote:
> > On Thu,
As alternative: checkpoint the dataframe, collect days, and then delete
corresponding directories using hadoop FileUtils, then write the dataframe
On Fri, Sep 29, 2017 at 10:31 AM, peay wrote:
> Hello,
>
> I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet")
> to write
How many files you produce? I believe it spends a lot of time on renaming
the files because of the output committer.
Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they
have 10GbE and you can get good throughput for S3.
On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech <
alex
allelize(…).map(test => {
Model.get().use(…)
})
}
}
```
On Thu, Sep 28, 2017 at 3:49 PM, Vadim Semenov
wrote:
> as an alternative
> ```
> spark-submit --files
> ```
>
> the files will be put on each executor in the working directory, so you
> can then load it alongside your
as an alternative
```
spark-submit --files
```
the files will be put on each executor in the working directory, so you can
then load it alongside your `map` function
Behind the scene it uses `SparkContext.addFile` method that you can use too
https://github.com/apache/spark/blob/master/core/src/m
Looks like there's slowness in sending shuffle files, maybe one executor
get overwhelmed with all the other executors trying to pull data?
Try lifting `spark.network.timeout` further, we ourselves had to push it to
600s from the default 120s
On Thu, Sep 28, 2017 at 10:19 AM, Ilya Karpov
wrote:
>
Instead of having one job, you can try processing each file in a separate
job, but run multiple jobs in parallel within one SparkContext.
Something like this should work for you, it'll submit N jobs from the
driver, the jobs will run independently, but executors will dynamically
work on different j
1. 40s is pretty negligible unless you run your job very frequently, there
can be many factors that influence that.
2. Try to compare the CPU time instead of the wall-clock time
3. Check the stages that got slower and compare the DAGs
4. Test with dynamic allocation disabled
On Fri, Sep 22, 201
This may also be related to
https://issues.apache.org/jira/browse/SPARK-22033
On Tue, Sep 19, 2017 at 3:40 PM, Mark Bittmann wrote:
> I've run into this before. The EigenValueDecomposition creates a Java
> Array with 2*k*n elements. The Java Array is indexed with a native integer
> type, so 2*k*
you can create a Super class "FunSuiteWithSparkContext" that's going to
create a Spark sessions, Spark context, and SQLContext with all the desired
properties.
Then you add the class to all the relevant test suites, and that's pretty
much it.
The other option can be is to pass it as a VM parameter
back into the "Active" section showing a
> small number of complete/inprogress tasks.
>
> In my screenshot, Job #1 completed as normal, and then later on switched
> back to active with only 92 tasks... it never seems to change again, it's
> stuck in this frozen, active s
ething other than to achieve this?
>
>
>
> Alex
>
>
>
> *From: *Vadim Semenov
> *Date: *Monday, August 28, 2017 at 5:18 PM
> *To: *"Mikhailau, Alex"
> *Cc: *"user@spark.apache.org"
> *Subject: *Re: Referencing YARN application id, YARN cont
When you create a EMR cluster you can specify a S3 path where logs will be
saved after cluster, something like this:
s3://bucket/j-18ASDKLJLAKSD/containers/application_1494074597524_0001/container_1494074597524_0001_01_01/stderr.gz
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-man
I didn't tailor it to your needs, but this is what I can offer you, the
idea should be pretty clear
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{collect_list, struct}
val spark: SparkSession
import spark.implicits._
case class Input(
a: Int,
b: Long,
c:
Something like this, maybe?
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.expressions.AttributeReference
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val df: DataFrame = ???
val spark = df.sparkSession
val rd
Scala doesn't support ranges >= Int.MaxValue
https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/immutable/Range.scala?utf8=✓#L89
You can create two RDDs and unionize them:
scala> val rdd = sc.parallelize(1L to
Int.MaxValue.toLong).union(sc.parallelize(1L to Int.MaxValue.toLon
ot;checkpoint"ed data in the same job, you
need to delete it somehow else.
BTW, have you tried '.persist(StorageLevel.DISK_ONLY)'? It caches data to
local disk, making more space in JVM and letting you to avoid hdfs.
On Wednesday, August 2, 2017, Vadim Semenov
wrote:
> `saveAsObj
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
just saves data to some destination.
`cache/persist` allow you to cache data and keep the DAG in case of some
executor that holds data goes down, so Spark would still be able to
recalculate missing partitions
`localCheckp
Also check the `RDD.checkpoint()` method
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1550
On Wed, Aug 2, 2017 at 8:46 PM, Vadim Semenov
wrote:
> I'm not sure that "checkpointed" means the same thing in that sentence.
in
> memory, otherwise saving it on a file will require recomputation."*
>
>
> To me that means checkpoint will not prevent the recomputation that i was
> hoping for
> --
> *From:* Vadim Semenov
> *Sent:* Tuesday, August 1, 2017 12:05:17 PM
> *
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory")
myrdd.checkpoint()
val result1 = myrdd.map(op1(_))
result1.count() // Will save `myrdd` to HDFS and do map(op1…
val result2 = myrdd.map(op2(_))
result2.count() // Will load `myrdd` from HDFS
This should work:
```
ALTER TABLE `table` ADD PARTITION (partcol=1) LOCATION
'/path/to/your/dataset'
```
On Wed, Jul 19, 2017 at 6:13 PM, ctang wrote:
> I wonder if there are any easy ways (or APIs) to insert a dataframe (or
> DataSet), which does not contain the partition columns, as a static
>
You need to trigger an action on that rdd to checkpoint it.
```
scala>spark.sparkContext.setCheckpointDir(".")
scala>val df = spark.createDataFrame(List(("Scala", 35), ("Python",
30), ("R", 15), ("Java", 20)))
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.rdd.check
Are you sure that you use S3A?
Because EMR says that they do not support S3A
https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/
> Amazon EMR does not currently support use of the Apache Hadoop S3A file
system.
I think that the HEAD requests come from the `createBucketIfNot
This is the code that chooses the partition for a key:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88
it's basically `math.abs(key.hashCode % numberOfPartitions)`
On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
vikash.par...@infoobjects
You can launch one permanent spark context and then execute your jobs
within the context. And since they'll be running in the same context, they
can share data easily.
These two projects provide the functionality that you need:
https://github.com/spark-jobserver/spark-jobserver#persistent-context-
It should be easy to start with a custom Hadoop InputFormat that reads the
file and creates a `RDD[Row]`, since you know the records size, it should
be pretty easy to make the InputFormat to produce splits, so then you could
read the file in parallel.
On Mon, Jun 12, 2017 at 6:01 AM, OBones wrote
Have you tried running a query? something like:
```
test.select("*").limit(10).show()
```
On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук
wrote:
> Hi guys,
>
> I need to execute hive queries on remote hive server from spark, but for
> some reasons i receive only column names(without data).
> Dat
I believe it shows only the tasks that have actually being executed, if
there were tasks with no data, they don't get reported.
I might be mistaken, if somebody has a good explanation, would also like to
hear.
On Fri, May 19, 2017 at 5:45 PM, Miles Crawford wrote:
> Hey ya'll,
>
> Trying to mig
Use the official mailing list archive
http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3ccajyeq0gh1fbhbajb9gghognhqouogydba28lnn262hfzzgf...@mail.gmail.com%3e
On Thu, May 11, 2017 at 2:50 PM, lucas.g...@gmail.com
wrote:
> Also, and this is unrelated to the actual question... Why
You can provide your own log directory, where Spark log will be saved, and
that you could replay afterwards.
Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and
run it.
Note! The path `s3://bucket/some/directory` must exist before you run your
job, it'll not be created automa
Check the source code for SparkLauncher:
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java#L541
a separate process will be started using `spark-submit` and if it uses
`yarn-cluster` mode, a driver may be launched on another NodeManager
You better ask folks in the spark-jobserver gitter channel:
https://github.com/spark-jobserver/spark-jobserver
On Wed, Dec 21, 2016 at 8:02 AM, Reza zade wrote:
> Hello
>
> I've extended the JavaSparkJob (job-server-0.6.2) and created an object
> of SQLContext class. my maven project doesn't hav
1 - 100 of 118 matches
Mail list logo