Hi Egor,
About the first problem i think you are right, it's make sense.
About the second problem, i check available resource on 8088 port and there
show 16 available cores. I start my job with 4 executors with 1 core each,
and 1gb per executor. My job use maximum 50mb of memory(just for test).
>
Hi,
How to set this parameters while launching spark shell
spark.shuffle.memoryFraction=0.5
and
spark.yarn.executor.memoryOverhead=1024
I tried giving like this but I am giving below error
spark-shell --master yarn --deploy-mode client --driver-memory 16G
--num-executors 500 executor-cores
Went ahead and opened
https://issues.apache.org/jira/browse/SPARK-19586
though I'd generally expect to just close it as fixed in 2.1.0 and roll on.
On Sat, Feb 11, 2017 at 5:01 PM, Everett Anderson wrote:
> On the plus side, looks like this may be fixed in 2.1.0:
>
> == Physical Plan ==
> *Has
You are right, you need that PR. I pinged the author, but otherwise it
would be great if someone could carry it over the finish line.
On Sat, Feb 11, 2017 at 4:19 PM, Jason White
wrote:
> I'd like to create a Dataset using some classes from Geotools to do some
> geospatial analysis. In particul
The original Uber authors provided this performance test result:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro
This was for MinHash only though, so it's not clear about what the
scalability is for the other metric types.
The SignRandomProjectionLSH is not yet in
Hello,
I specified a StructType like this:
*val mySchema = StructType(Array(StructField("f1", StringType,
true),StructField("f2", StringType, true)))*
I have many ORC files stored in HDFS location:*
/user/hos/orc_files_test_together
*
These files use different schema : some of them have only f
I came across the same problem while I ran my code at "model.save(sc, path)"
Error info:
IllegalArgumentException: u"Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':"
My platform is Mac, I installed Spark with Hadoop prebuilt. Then I
integrated PySpark with Jupyter.
Anyone
About second problem: I understand this can be in two cases: when one job
prevents the other one from getting resources for executors or (2)
bottleneck is reading from disk, so you can not really parallel that. I
have no experience with second case, but it's easy to verify the fist one:
just look o
"But if i increase only executor-cores the finish time is the same". More
experienced ones can correct me, if I'm wrong, but as far as I understand
that: one partition processed by one spark task. Task is always running on
1 core and not parallelized among cores. So if you have 5 partitions and
you
Dear All,
I am using spark-xml_2.10 to parse and extract some data from XML files.
I got the issue of getting null value whereas the XML file contains actually
values.
++-
How much memory have you allocated to the driver? Driver stores some state
for tracking the task, stage and job history that you can see in the spark
console, it does take up a significant portion of the heap, anywhere from
200MB - 1G, depending no your map reduce steps.
Either way that is a good
My take on the 2-3 tasks per CPU core is that you want to ensure you are
utilizing the cores to the max, which means it will help you with scaling
and performance. The question would be why not 1 task per core? The reason
is that you can probably get a good handle on the average execution time
per
Spark has more support for scala, by that I mean more APIs are available
for scala compared to python or Java. Also scala code will be more concise
and easy to read. Java is very verbose.
On Thu, Feb 9, 2017 at 10:21 PM, Irving Duran
wrote:
> I would say Java, since it will be somewhat similar t
Nancy,
As your log output indicated, your executor 11 GB memory limit.
While you might want to address the root cause/data volume as suggested by Jon,
you can do an immediate test by changing your command as follows
spark-shell --master yarn --deploy-mode client --driver-memory 16G
--num-execut
Your vendor should use the parquet internal compression and not take a parquet
file and gzip it.
> On 13 Feb 2017, at 18:48, Benjamin Kim wrote:
>
> We are receiving files from an outside vendor who creates a Parquet data file
> and Gzips it before delivery. Does anyone know how to Gunzip the
We are receiving files from an outside vendor who creates a Parquet data file
and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark
and inject the Parquet data into a DataFrame? I thought using sc.textFile or
sc.wholeTextFiles would automatically Gunzip the file, but I’m
If you update the data, then you don't have the same DataFrame anymore. If
you don't do like Assaf did, caching and forcing evaluation of the
DataFrame before using that DataFrame concurrently, then you'll still get
consistent and correct results, but not necessarily efficient results. If
the fully
How about having a thread that update and cache a dataframe in-memory next
to other threads requesting this dataframe, is it thread safe ?
2017-02-13 9:02 GMT+01:00 Reynold Xin :
> Yes your use case should be fine. Multiple threads can transform the same
> data frame in parallel since they create
Spark has a zipWithIndex function for RDDs (
http://stackoverflow.com/a/26081548) that adds an index column right after
you create an RDD, and I believe it preserves order. Then you can sort it
by the index after the cache step.
I haven't tried this with a Dataframe but this answer seems promisin
Hi,
I think i don't understand enough how to launch jobs.
I have one job which takes 60 seconds to finish. I run it with following
command:
spark-submit --executor-cores 1 \
--executor-memory 1g \
--driver-memory 1g \
--master yarn \
--deploy-m
Setting Spark's memoryOverhead configuration variable is recommended in
your logs, and has helped me with these issues in the past. Search for
"memoryOverhead" here:
http://spark.apache.org/docs/latest/running-on-yarn.html
That said, you're running on a huge cluster as it is. If it's possible to
I have a rather heavy metal shared library which among other options can also
be accessed via a Java/JNI wrapper JAR (the library itself is written in
C++). This library needs up to 1000 external files which in total can be
larger than 50 GBytes in size. At init time all this data needs to be read
Hadoop 2.8's s3a does a lot more metrics here, most of which you can find on
HDP-2.5 if you can grab those JARs. Everything comes out as hadoop JMX metrics,
also readable & aggregatable through a call to FileSystem.getStorageStatistics
Measuring IO time isn't something picked up, because it's a
RDDs and DataFrames do not guarantee any specific ordering of data. They
are like tables in a SQL database. The only way to get a guaranteed
ordering of rows is to explicitly specify an orderBy() clause in your
statement. Any ordering you see otherwise is incidental.
On Mon, Feb 13, 2017 at 7:52
Hi,
I found something that surprised me, I expected the order of the rows to be
preserved, so I suspect this might be a bug. The problem is illustrated with
the Python example below:
In [1]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.cache()
df.count()
df.coalesce(2).rdd.glo
Hi All,,
I am getting below error while I am trying to join 3 tables which are in
ORC format in hive from 5 10gb tables through hive context in spark
Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Hi Ben,
You can replace HDFS with a number of storage systems since Spark is
compatible with other storage like S3. This would allow you to scale your
compute nodes solely for the purpose of adding compute power and not disk
space. You can deploy Alluxio on your compute nodes to offset the
perform
IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer
which supports different FS implementations, HDFS is just one option. You
could also use S3 as a backend FS, from Spark's point it is transparent to
different FS implementations.
On Sun, Feb 12, 2017 at 5:32 PM, ayan guh
Yes your use case should be fine. Multiple threads can transform the same
data frame in parallel since they create different data frames.
On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf
wrote:
> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spar
for my understanding, all transformations are thread-safe cause dataframe
is just a description of the calculation and it's immutable, so the case
above is all right. just be careful with the actions.
On Sun, Feb 12, 2017 at 4:06 PM, Mendelson, Assaf
wrote:
> Hi,
>
> I was wondering if dataframe
30 matches
Mail list logo