Solved:
Call spark-submit with
--driver-memory 512m --driver-java-options
"-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2
-Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2"
Thanks to:
https://issues.apache.org/jira/browse/SPARK-14367
--
View this
Hi,
I am trying to get the same memory behavior in Spark 1.6 as I had in Spark
1.3 with default settings.
I set
--driver-java-options "--Dspark.memory.useLegacyMode=true
-Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6
-Dspark.storage.unrollFraction=0.2"
in Spark 1.6.
But
to not use HDFS)
* Bonus question: Should I use a different API to get a better performance?
Thanks for any responses!
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html
Sent from
?
Thanks in advance,
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
metrics will someday be included in the Hadoop FileStatistics
API. In the meantime, it is not currently possible to understand how much of
a Spark task's time is spent reading from disk via HDFS.
That said, this might be posted as a footnote at the event timeline to avoid
confusion :)
Best regards,
Tom
I believe that as you are not persisting anything into the memory space
defined by
spark.storage.memoryFraction
you also have nothing to clear from this area using the unpersist.
FYI: The data will be kept in the OS-buffer/on disk at the point of the
reduce (as this involves a wide dependency -
is only available on pairRDD's, this might have something to with it..)
I am using the spark master branch. The error:
[error]
/home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107:
value partitionBy is not a member of org.apache.spark.sql.DataFrame
Thanks,
Tom
I've looked a bit into what DataFrames are, and it seems that most posts on
the subject are related to SQL, but it does seem to be very efficient. My
main questions is: Are DataFrames also beneficial for non-SQL computations?
For instance I want to:
- sort k/v pairs (in particular, is the naive
Thanks for the responses.
Try removing toDebugString and see what happens.
The toDebugString is performed after [d] (the action), as [e]. By then all
stages are already executed.
--
View this message in context:
]), and with larger input set can also take
a noticeable time. Does anybody have any idea what is running in this
Job/stage 0?
Thanks,
Tom Hubregtsen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation
I'm not sure, but I wonder if because you are using the Spark REPL that it
may not be representing what a normal runtime execution would look like and
is possibly eagerly running a partial DAG once you define an operation that
would cause a shuffle.
What happens if you setup your same set of
a demand, but did not succeed in finding the actual
source code.
My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that
this
seems to be withheld?
Thanks for any help,
Tom Hubregtsen
[1
Updated spark-defaults and spark-env:
Log directory /home/hduser/spark/spark-events does not exist.
(Also, in the default /tmp/spark-events it also did not work)
On 30 March 2015 at 18:03, Marcelo Vanzin van...@cloudera.com wrote:
Are those config values in spark-defaults.conf? I don't think
?
(It always helps to show the command line you're actually running, and
if there's an exception, the first few frames of the stack trace.)
On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com
wrote:
Updated spark-defaults and spark-env:
Log directory /home/hduser/spark/spark-events
but the user could not navigate to it or (iii) it existed but
was not actually a directory.
So please double-check all that.
On Mon, Mar 30, 2015 at 5:11 PM, Tom Hubregtsen thubregt...@gmail.com
wrote:
Stack trace:
15/03/30 17:37:30 INFO storage.BlockManagerMaster: Registered
BlockManager
.pdf.
It is expected to scale sub-linearly; i.e., O(log N), where N is the
number of machines in your cluster.
We evaluated up to 100 machines, and it does follow O(log N) scaling.
--
Mosharaf Chowdhury
http://www.mosharaf.com/
On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen thubregt
Thanks Mosharaf, for the quick response! Can you maybe give me some
pointers to an explanation of this strategy? Or elaborate a bit more on it?
Which parts are involved in which way? Where are the time penalties and how
scalable is this implementation?
Thanks again,
Tom
On 11 March 2015 at
17 matches
Mail list logo