We use Algebird for calculating things like min/max, stddev, variance, etc.
https://github.com/twitter/algebird/wiki
-Suren
On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
You should *never* use accumulators for this purpose because you may get
incorrect
Mohammed,
Jumping in for Daniel, we actually address the configuration issue by
pulling values from environment variables or command line options. Maybe
that may handle at least some of your needs.
For the akka issue, here is the akka version we include in build.sbt:
com.typesafe.akka %%
As Sungwook said, the classpath pointing to the mapr jar is the key for
that error.
MapR has a Spark install that hopefully makes it easier. I don't have the
instructions handy but you can asking their support about it.
-Suren
On Wed, Oct 1, 2014 at 7:18 PM, Matei Zaharia
History Server is also very helpful.
On Thu, Jul 10, 2014 at 7:37 AM, Haopu Wang hw...@qilinsoft.com wrote:
I didn't keep the driver's log. It's a lesson.
I will try to run it again to see if it happens again.
--
*From:* Tathagata Das
Are there any gaps beyond convenience and code/config separation in using
spark-submit versus SparkConf/SparkContext if you are willing to set your
own config?
If there are any gaps, +1 on having parity within SparkConf/SparkContext
where possible. In my use case, we launch our jobs
I'll respond for Dan.
Our test dataset was a total of 10 GB of input data (full production
dataset for this particular dataflow would be 60 GB roughly).
I'm not sure what the size of the final output data was but I think it was
on the order of 20 GBs for the given 10 GB of input data. Also, I
.
Good luck.
Kevin
On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:
I'll respond for Dan.
Our test dataset was a total of 10 GB of input data (full production
dataset for this particular dataflow would be 60 GB roughly).
I'm not sure what the size of the final output data
to 256GB+.
K
On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:
To clarify, we are not persisting to disk. That was just one of the
experiments we did because of some issues we had along the way.
At this time, we are NOT using persist but cannot get the flow to
complete in Standalone
Also, our exact same flow but with 1 GB of input data completed fine.
-Suren
On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
How wide are the rows of data, either the raw input data or any generated
intermediate data?
We are at a loss as to why our flow
.
We're still relatively new with Spark (a few months), so would also love to
hear more from others in the community.
-Suren
On Tue, Jul 8, 2014 at 2:17 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Also, our exact same flow but with 1 GB of input data completed fine.
-Suren
robust with partitions of data that don't fit
in memory though. A lot of the work in the next few releases will be on
that.
On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
I'll respond for Dan.
Our test dataset was a total of 10 GB of input data (full
to sacrifice speed (if
the slowdown is not too big - I'm doing batch processing, nothing
real-time) for code simplicity and readability.
On Fri, Jul 4, 2014 at 3:16 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
When using DISK_ONLY, keep in mind that disk I/O is pretty high
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure
you are writing to multiple disks for best operation. And even with
DISK_ONLY, we've found that there is a minimum threshold for executor ram
(spark.executor.memory), which for us seemed to be around 8 GB.
If you find
I've had some odd behavior with jobs showing up in the history server in
1.0.0. Failed jobs do show up but it seems they can show up minutes or
hours later. I see in the history server logs messages about bad task ids.
But then eventually the jobs show up.
This might be your situation.
One thing we ran into was that there was another log4j.properties earlier
in the classpath. For us, it was in our MapR/Hadoop conf.
If that is the case, something like the following could help you track it
down. The only thing to watch out for is that you might have to walk up the
classloader
I unfortunately haven't seen this directly. But some typical things I try
when debugging are as follows.
Do you see a corresponding error on the other side of that connection
(alpinenode7.alpinenow.local)? Or is that the same machine?
Also, do the driver logs show any longer stack trace and have
I can't speak for MLlib, too. But I can say the model of training in Hadoop
M/R or Spark and production scoring in Storm works very well. My team has
done online learning (Sofia ML library, I think) in Storm as well.
I would be interested in this answer as well.
-Suren
On Thu, Jun 19, 2014 at
-Suren
On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Looks like eventually there was some type of reset or timeout and the
tasks have been reassigned. I'm guessing they'll keep failing until max
failure count.
The machine it disconnected from
wrote:
Out of curiosity - are you guys using speculation, shuffle
consolidation, or any other non-default option? If so that would help
narrow down what's causing this corruption.
On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Matt/Ryan,
Did you
I have a flow that ends with saveAsTextFile() to HDFS.
It seems all the expected files per partition have been written out, based
on the number of part files and the file sizes.
But the driver logs show 2 tasks still not completed and has no activity
and the worker logs show no activity for
On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
I have a flow that ends with saveAsTextFile() to HDFS.
It seems all the expected files per partition have been written out, based
on the number of part files and the file sizes.
But the driver logs show 2
Vivek,
If the foldByKey solution doesn't work for you, my team uses
RDD.persist(DISK_ONLY) to avoid OOM errors.
It's slower, of course, and requires tuning other config parameters. It can
also be a problem if you do not have enough disk space, meaning that you
have to unpersist at the right
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?
On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you
don’t mind the WARNING in the logs
you can set spark.executor.extraJavaOpts in your SparkConf obj
Best,
--
/06/10 18:51:14 INFO network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(172.16.25.125,45610)
On Wed, Jun 11, 2014 at 8:38 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
I have a somewhat large job (10 GB input data but generates about 500 GB
of data after
Event logs are different from writing using a logger, like log4j. The event
logs are the type of data showing up in the history server.
For my team, we use com.typesafe.scalalogging.slf4j.Logging. Our logs show
up in /etc/spark/work/app-id/executor-id/stderr and stdout.
All of our logging seems
I have a dataset of about 10GB. I am using persist(DISK_ONLY) to avoid out
of memory issues when running my job.
When I run with a dataset of about 1 GB, the job is able to complete.
But when I run with the larger dataset of 10 GB, I get the following
error/stacktrace, which seems to be
)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
On Mon, Jun 9, 2014 at 10:05 PM, Surendranauth Hiraman
suren.hira
, Surendranauth Hiraman
suren.hira...@velos.io wrote:
I don't know if this is related but a little earlier in stderr, I also
have the following stacktrace. But this stacktrace seems to be when the
code is grabbing RDD data from a remote node, which is different from the
above.
14/06/09 21:33:26
With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3
CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node.
Everything works fine, though if you are using MapR, you have to make sure
they are all on the same subnet.
-Suren
On Fri, May 30, 2014 at 12:20 PM,
My team is successfully running on Spark on MapR.
However, we add the mapr jars to the SPARK_CLASSAPTH on the workers, as
well as making sure they are on the classpath of the driver.
I'm not sure if we need every jar that we currently add but below is what
we currently use. The important file in
We use the mapr rpm and have successfully read and written hdfs data.
Are you using custom readers/writers? Maybe the relevant stacktrace might
help.
Maybe also try a standard text reader and writer to see if there is a basic
issue with accessing mfs?
-Suren
On Mon, May 26, 2014 at 11:31 AM,
When I have stack traces, I usually see the MapR versions of the various
hadoop classes, though maybe that's at a deeper level of the stack trace.
If my memory is right though, this may point to the classpath having
regular hadoop jars before the standard hadoop jars. My guess is that this
is on
If the purpose is only aliasing, rather than adding additional methods and
avoiding runtime allocation, what about type aliases?
type ID = String
type Name = String
On Sat, Apr 19, 2014 at 9:26 PM, kamatsuoka ken...@gmail.com wrote:
No, you can wrap other types in value classes as well. You
Oh, sorry, I think your point was probably you wouldn't need runtime
allocation.
I guess that is the key question. I would be interested if this works for
you.
-Suren
On Sun, Apr 20, 2014 at 9:18 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
If the purpose is only aliasing
Prashant,
In another email thread several weeks ago, it was mentioned that YARN
support is considered beta until Spark 1.0. Is that not the case?
-Suren
On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma scrapco...@gmail.comwrote:
Hi Ishaaq,
answers inline from what I know, I had like to be
in
http://spark.incubator.apache.org/docs/latest/configuration.html.
Matei
On Apr 11, 2014, at 7:02 AM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Matei,
Where is the functionality in 0.9 to spill data within a task (separately
from persist)? My apologies if this is something
Hi,
Any thoughts on this? Thanks.
-Suren
On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Hi,
I know if we call persist with the right options, we can have Spark
persist an RDD's data on disk.
I am wondering what happens in intermediate operations
Hi,
We have a situation where a Pyspark script works fine as a local process
(local url) on the Master and the Worker nodes, which would indicate that
all python dependencies are set up properly on each machine.
But when we try to run the script at the cluster level (using the master's
url), if
trying to get a sense of how
the processing is handled behind the scenes with respect to disk.
2. When else is disk used internally?
Any pointers are appreciated.
-Suren
On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Hi,
Any thoughts on this? Thanks
Hi,
I know if we call persist with the right options, we can have Spark persist
an RDD's data on disk.
I am wondering what happens in intermediate operations that could
conceivably create large collections/Sequences, like GroupBy and shuffling.
Basically, one part of the question is when is
of locking/distributed locking is needed on the
individual Bloom Filter itself, with performance impact.
Agreed?
-Suren
On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Mayur,
Thanks. This step is for creating the Bloom Filter, not using it to filter
data
Grouped by the group_id but not sorted.
-Suren
On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
You are using the data grouped (sorted?) To create the bloom filter ?
On Mar 20, 2014 4:35 PM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Mayur
42 matches
Mail list logo