Re: Using spark.memory.useLegacyMode true does not yield expected behavior

2016-04-11 Thread Tom Hubregtsen
Solved: Call spark-submit with --driver-memory 512m --driver-java-options "-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" Thanks to: https://issues.apache.org/jira/browse/SPARK-14367 -- View this

Using spark.memory.useLegacyMode true does not yield expected behavior

2016-03-29 Thread Tom Hubregtsen
Hi, I am trying to get the same memory behavior in Spark 1.6 as I had in Spark 1.3 with default settings. I set --driver-java-options "--Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" in Spark 1.6. But

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen
to not use HDFS) * Bonus question: Should I use a different API to get a better performance? Thanks for any responses! Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html Sent from

Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
? Thanks in advance, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
metrics will someday be included in the Hadoop FileStatistics API. In the meantime, it is not currently possible to understand how much of a Spark task's time is spent reading from disk via HDFS. That said, this might be posted as a footnote at the event timeline to avoid confusion :) Best regards, Tom

Re: Un-persist RDD in a loop

2015-06-23 Thread Tom Hubregtsen
I believe that as you are not persisting anything into the memory space defined by spark.storage.memoryFraction you also have nothing to clear from this area using the unpersist. FYI: The data will be kept in the OS-buffer/on disk at the point of the reduce (as this involves a wide dependency -

PartitionBy/Partitioner for dataFrames?

2015-06-21 Thread Tom Hubregtsen
is only available on pairRDD's, this might have something to with it..) I am using the spark master branch. The error: [error] /home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107: value partitionBy is not a member of org.apache.spark.sql.DataFrame Thanks, Tom

DataFrames for non-SQL computation?

2015-06-11 Thread Tom Hubregtsen
I've looked a bit into what DataFrames are, and it seems that most posts on the subject are related to SQL, but it does seem to be very efficient. My main questions is: Are DataFrames also beneficial for non-SQL computations? For instance I want to: - sort k/v pairs (in particular, is the naive

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Thanks for the responses. Try removing toDebugString and see what happens. The toDebugString is performed after [d] (the action), as [e]. By then all stages are already executed. -- View this message in context:

Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
]), and with larger input set can also take a noticeable time. Does anybody have any idea what is running in this Job/stage 0? Thanks, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen
a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
Updated spark-defaults and spark-env: Log directory /home/hduser/spark/spark-events does not exist. (Also, in the default /tmp/spark-events it also did not work) On 30 March 2015 at 18:03, Marcelo Vanzin van...@cloudera.com wrote: Are those config values in spark-defaults.conf? I don't think

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
? (It always helps to show the command line you're actually running, and if there's an exception, the first few frames of the stack trace.) On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Updated spark-defaults and spark-env: Log directory /home/hduser/spark/spark-events

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
but the user could not navigate to it or (iii) it existed but was not actually a directory. So please double-check all that. On Mon, Mar 30, 2015 at 5:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Stack trace: 15/03/30 17:37:30 INFO storage.BlockManagerMaster: Registered BlockManager

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
.pdf. It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We evaluated up to 100 machines, and it does follow O(log N) scaling. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen thubregt

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at