Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Surendranauth Hiraman
We use Algebird for calculating things like min/max, stddev, variance, etc. https://github.com/twitter/algebird/wiki -Suren On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: You should *never* use accumulators for this purpose because you may get incorrect

Re: Play framework

2014-10-16 Thread Surendranauth Hiraman
Mohammed, Jumping in for Daniel, we actually address the configuration issue by pulling values from environment variables or command line options. Maybe that may handle at least some of your needs. For the akka issue, here is the akka version we include in build.sbt: com.typesafe.akka %%

Re: Spark And Mapr

2014-10-01 Thread Surendranauth Hiraman
As Sungwook said, the classpath pointing to the mapr jar is the key for that error. MapR has a Spark install that hopefully makes it easier. I don't have the instructions handy but you can asking their support about it. -Suren On Wed, Oct 1, 2014 at 7:18 PM, Matei Zaharia

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Surendranauth Hiraman
History Server is also very helpful. On Thu, Jul 10, 2014 at 7:37 AM, Haopu Wang hw...@qilinsoft.com wrote: I didn't keep the driver's log. It's a lesson. I will try to run it again to see if it happens again. -- *From:* Tathagata Das

Re: Purpose of spark-submit?

2014-07-09 Thread Surendranauth Hiraman
Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
to 256GB+. K On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
. We're still relatively new with Spark (a few months), so would also love to hear more from others in the community. -Suren On Tue, Jul 8, 2014 at 2:17 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Also, our exact same flow but with 1 GB of input data completed fine. -Suren

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
robust with partitions of data that don't fit in memory though. A lot of the work in the next few releases will be on that. On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full

Re: Spark memory optimization

2014-07-07 Thread Surendranauth Hiraman
to sacrifice speed (if the slowdown is not too big - I'm doing batch processing, nothing real-time) for code simplicity and readability. On Fri, Jul 4, 2014 at 3:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: When using DISK_ONLY, keep in mind that disk I/O is pretty high

Re: Spark memory optimization

2014-07-04 Thread Surendranauth Hiraman
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure you are writing to multiple disks for best operation. And even with DISK_ONLY, we've found that there is a minimum threshold for executor ram (spark.executor.memory), which for us seemed to be around 8 GB. If you find

Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-03 Thread Surendranauth Hiraman
I've had some odd behavior with jobs showing up in the history server in 1.0.0. Failed jobs do show up but it seems they can show up minutes or hours later. I see in the history server logs messages about bad task ids. But then eventually the jobs show up. This might be your situation.

Re: Changing log level of spark

2014-07-01 Thread Surendranauth Hiraman
One thing we ran into was that there was another log4j.properties earlier in the classpath. For us, it was in our MapR/Hadoop conf. If that is the case, something like the following could help you track it down. The only thing to watch out for is that you might have to walk up the classloader

Re: Spark executor error

2014-06-26 Thread Surendranauth Hiraman
I unfortunately haven't seen this directly. But some typical things I try when debugging are as follows. Do you see a corresponding error on the other side of that connection (alpinenode7.alpinenow.local)? Or is that the same machine? Also, do the driver logs show any longer stack trace and have

Re: MLLib inside Storm : silly or not ?

2014-06-19 Thread Surendranauth Hiraman
I can't speak for MLlib, too. But I can say the model of training in Hadoop M/R or Spark and production scoring in Storm works very well. My team has done online learning (Sofia ML library, I think) in Storm as well. I would be interested in this answer as well. -Suren On Thu, Jun 19, 2014 at

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Surendranauth Hiraman
-Suren On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Looks like eventually there was some type of reset or timeout and the tasks have been reassigned. I'm guessing they'll keep failing until max failure count. The machine it disconnected from

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman
wrote: Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you

Trailing Tasks Saving to HDFS

2014-06-18 Thread Surendranauth Hiraman
I have a flow that ends with saveAsTextFile() to HDFS. It seems all the expected files per partition have been written out, based on the number of part files and the file sizes. But the driver logs show 2 tasks still not completed and has no activity and the worker logs show no activity for

Re: Trailing Tasks Saving to HDFS

2014-06-18 Thread Surendranauth Hiraman
On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: I have a flow that ends with saveAsTextFile() to HDFS. It seems all the expected files per partition have been written out, based on the number of part files and the file sizes. But the driver logs show 2

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Surendranauth Hiraman
Vivek, If the foldByKey solution doesn't work for you, my team uses RDD.persist(DISK_ONLY) to avoid OOM errors. It's slower, of course, and requires tuning other config parameters. It can also be a problem if you do not have enough disk space, meaning that you have to unpersist at the right

Re: long GC pause during file.cache()

2014-06-15 Thread Surendranauth Hiraman
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, --

Re: Error During ReceivingConnection

2014-06-11 Thread Surendranauth Hiraman
/06/10 18:51:14 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(172.16.25.125,45610) On Wed, Jun 11, 2014 at 8:38 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I have a somewhat large job (10 GB input data but generates about 500 GB of data after

Re: Spark Logging

2014-06-10 Thread Surendranauth Hiraman
Event logs are different from writing using a logger, like log4j. The event logs are the type of data showing up in the history server. For my team, we use com.typesafe.scalalogging.slf4j.Logging. Our logs show up in /etc/spark/work/app-id/executor-id/stderr and stdout. All of our logging seems

FileNotFoundException when using persist(DISK_ONLY)

2014-06-09 Thread Surendranauth Hiraman
I have a dataset of about 10GB. I am using persist(DISK_ONLY) to avoid out of memory issues when running my job. When I run with a dataset of about 1 GB, the job is able to complete. But when I run with the larger dataset of 10 GB, I get the following error/stacktrace, which seems to be

Re: FileNotFoundException when using persist(DISK_ONLY)

2014-06-09 Thread Surendranauth Hiraman
) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) On Mon, Jun 9, 2014 at 10:05 PM, Surendranauth Hiraman suren.hira

Re: FileNotFoundException when using persist(DISK_ONLY)

2014-06-09 Thread Surendranauth Hiraman
, Surendranauth Hiraman suren.hira...@velos.io wrote: I don't know if this is related but a little earlier in stderr, I also have the following stacktrace. But this stacktrace seems to be when the code is grabbing RDD data from a remote node, which is different from the above. 14/06/09 21:33:26

Re: Spark 1.0.0 - Java 8

2014-05-30 Thread Surendranauth Hiraman
With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3 CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node. Everything works fine, though if you are using MapR, you have to make sure they are all on the same subnet. -Suren On Fri, May 30, 2014 at 12:20 PM,

Re: maprfs and spark libraries

2014-05-26 Thread Surendranauth Hiraman
My team is successfully running on Spark on MapR. However, we add the mapr jars to the SPARK_CLASSAPTH on the workers, as well as making sure they are on the classpath of the driver. I'm not sure if we need every jar that we currently add but below is what we currently use. The important file in

Re: maprfs and spark libraries

2014-05-26 Thread Surendranauth Hiraman
We use the mapr rpm and have successfully read and written hdfs data. Are you using custom readers/writers? Maybe the relevant stacktrace might help. Maybe also try a standard text reader and writer to see if there is a basic issue with accessing mfs? -Suren On Mon, May 26, 2014 at 11:31 AM,

Re: maprfs and spark libraries

2014-05-26 Thread Surendranauth Hiraman
When I have stack traces, I usually see the MapR versions of the various hadoop classes, though maybe that's at a deeper level of the stack trace. If my memory is right though, this may point to the classpath having regular hadoop jars before the standard hadoop jars. My guess is that this is on

Re: Anyone using value classes in RDDs?

2014-04-20 Thread Surendranauth Hiraman
If the purpose is only aliasing, rather than adding additional methods and avoiding runtime allocation, what about type aliases? type ID = String type Name = String On Sat, Apr 19, 2014 at 9:26 PM, kamatsuoka ken...@gmail.com wrote: No, you can wrap other types in value classes as well. You

Re: Anyone using value classes in RDDs?

2014-04-20 Thread Surendranauth Hiraman
Oh, sorry, I think your point was probably you wouldn't need runtime allocation. I guess that is the key question. I would be interested if this works for you. -Suren On Sun, Apr 20, 2014 at 9:18 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: If the purpose is only aliasing

Re: standalone vs YARN

2014-04-15 Thread Surendranauth Hiraman
Prashant, In another email thread several weeks ago, it was mentioned that YARN support is considered beta until Spark 1.0. Is that not the case? -Suren On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma scrapco...@gmail.comwrote: Hi Ishaaq, answers inline from what I know, I had like to be

Re: Spark - ready for prime time?

2014-04-11 Thread Surendranauth Hiraman
in http://spark.incubator.apache.org/docs/latest/configuration.html. Matei On Apr 11, 2014, at 7:02 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matei, Where is the functionality in 0.9 to spill data within a task (separately from persist)? My apologies if this is something

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations

PySpark SocketConnect Issue in Cluster

2014-04-07 Thread Surendranauth Hiraman
Hi, We have a situation where a Pyspark script works fine as a local process (local url) on the Master and the Worker nodes, which would indicate that all python dependencies are set up properly on each machine. But when we try to run the script at the cluster level (using the master's url), if

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
trying to get a sense of how the processing is handled behind the scenes with respect to disk. 2. When else is disk used internally? Any pointers are appreciated. -Suren On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, Any thoughts on this? Thanks

Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
of locking/distributed locking is needed on the individual Bloom Filter itself, with performance impact. Agreed? -Suren On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Mayur, Thanks. This step is for creating the Bloom Filter, not using it to filter data

Re: Accessing the reduce key

2014-03-20 Thread Surendranauth Hiraman
Grouped by the group_id but not sorted. -Suren On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: You are using the data grouped (sorted?) To create the bloom filter ? On Mar 20, 2014 4:35 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Mayur