Re: Tools to manage workflows on Spark

2015-03-01 Thread Himanish Kushary
We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline for orchestration of the different spark jobs. AWS datapipeline provides automatic EMR cluster provisioning, retry on failure,SNS notification etc. out of the box and works well for us. On Sun, Mar 1, 2015 at 7:02 PM,

Re: Streaming scheduling delay

2015-03-01 Thread Josh J
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas gerard.m...@gmail.com wrote: KafkaOutputServicePool Could you please give an example code of how KafkaOutputServicePool would look like? When I tried object pooling I end up with various not serializable exceptions. Thanks! Josh

Re: Tools to manage workflows on Spark

2015-03-01 Thread Felix C
We use Oozie as well, and it has worked well. The catch is each action in Oozie is separate and one cannot retain SparkContext or RDD, or leverage caching or temp table, going into another Oozie action. You could either save output to file or put all Spark processing into one Oozie action. ---

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread Vijay Saraswat
GML is a fast, distributed, in-memory sparse (and dense) matrix libraries. It does not use RDDs for resilience. Instead we have examples that use Resilient X10 (which provides recovery of distributed control structures in case of node failure) and Hazelcast for stable storage. We are looking

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap: ParamMap): DataFrame = { // How can I do a map partition on the underlying RDD and then add the column ? } } On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa

Re: Number of cores per executor on Spark Standalone

2015-03-01 Thread Deborah Siegel
Hi, Someone else will have a better answer. I think that for standalone mode, executors will grab whatever cores they can based on either configurations on the worker, or application specific configurations. Could be wrong, but I believe mesos is similar to this- and that YARN is alone in the

Re: Scalable JDBCRDD

2015-03-01 Thread Jörn Franke
What database are you using? Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit : Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not

Re: Connection pool in workers

2015-03-01 Thread A.K.M. Ashrafuzzaman
Sorry guys may bad, Here is a high level code sample, val unionStreams = ssc.union(kinesisStreams) unionStreams.foreachRDD(rdd = { rdd.foreach(tweet = val strTweet = new String(tweet, UTF-8) val interaction = InteractionParser.parser(strTweet) interactionDAL.insert(interaction) )

Re: Scalable JDBCRDD

2015-03-01 Thread Cody Koeninger
I'm a little confused by your comments regarding LIMIT. There's nothing about JdbcRDD that depends on limit. You just need to be able to partition your data in some way such that it has numeric upper and lower bounds. Primary key range scans, not limit, would ordinarily be the best way to do

Spark Streaming testing strategies

2015-03-01 Thread Marcin Kuthan
I have started using Spark and Spark Streaming and I'm wondering how do you test your applications? Especially Spark Streaming application with window based transformations. After some digging I found ManualClock class to take full control over stream processing. Unfortunately the class is not

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread shahab
Thanks Vijay, but the setup requirement for GML was not straightforward for me at all, so I put it aside for a while. best, /Shahab On Sun, Mar 1, 2015 at 9:34 AM, Vijay Saraswat vi...@saraswat.org wrote: GML is a fast, distributed, in-memory sparse (and dense) matrix libraries. It does

RE: Spark SQL Stackoverflow error

2015-03-01 Thread Jishnu Prathap
Hi The Issue was not fixed . I removed the between sql layer and directly created features from the file. Regards Jishnu Prathap From: lovelylavs [via Apache Spark User List] [mailto:ml-node+s1001560n21862...@n3.nabble.com] Sent: Sunday, March 01, 2015 4:44 AM To: Jishnu Menath Prathap (WT01 -

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread shahab
Thanks Josef for the comments, I think I need to do some benchmarking. best, /Shahab On Sun, Mar 1, 2015 at 1:25 AM, Joseph Bradley jos...@databricks.com wrote: Hi Shahab, There are actually a few distributed Matrix types which support sparse representations: RowMatrix, IndexedRowMatrix,

Submitting jobs on Spark EC2 cluster: class not found, even if it's on CLASSPATH

2015-03-01 Thread olegshirokikh
Hi there, I'm trying out Spark Job Server (REST) to submit jobs to spark cluster. I believe that my problem is unrelated to this specific software, but otherwise generic issue with missing jars on paths. So every application implements the trait with SparkJob class: /object LongPiJob extends

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
Hi Joseph, Thank your for the tips. I understand what should I do when my data are represented as a RDD. The thing that I can't figure out is how to do the same thing when the data is view as a DataFrame and I need to add the result of my pretrained model as a new column in the DataFrame.

Re: Columnar-Oriented RDDs

2015-03-01 Thread Night Wolf
Thanks for the comments guys. Parquet is awesome. My question with using Parquet for on disk storage - how should I load that into memory as a spark RDD and cache it and keep it in a columnar format? I know I can use Spark SQL with parquet which is awesome. But as soon as I step out of SQL we

Re: Scalable JDBCRDD

2015-03-01 Thread michal.klo...@gmail.com
Jorn: Vertica Cody: I posited the limit just as an example of how jdbcrdd could be used least invasively. Let's say we used a partition on a time field -- we would still need to have N executions of those queries. The queries we have are very intense and concurrency is an issue even if the

Store Spark data into hive table

2015-03-01 Thread tarek_abouzeid
I am trying to store my word count output into hive data warehouse my pipeline is: Flume streaming = spark do word count = store result in hive table for visualization later my code is : *import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import

Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
I am trying to compute column similarities on a 30x1000 RowMatrix of DenseVectors. The size of the input RDD is 3.1MB and its all in one partition. I am running on a single node of 15G and giving the driver 1G and the executor 9G. This is on a single node hadoop. In the first attempt the

RE: Is SPARK_CLASSPATH really deprecated?

2015-03-01 Thread Taeyun Kim
spark.executor.extraClassPath is especially useful when the output is written to HBase, since the data nodes on the cluster have HBase library jars. -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, February 27, 2015 5:22 PM To: Kannan Rajah Cc: Marcelo

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
Sorry, I actually meant 30 x 1 matrix (missed a 0) Regards Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sab, In this dense case, the output will contain 1 x 1 entries, i.e. 100 million doubles, which doesn't fit in 1GB with overheads. For a dense matrix, similarColumns() scales quadratically in the number of columns, so you need more memory across the cluster. Reza On Sun, Mar 1, 2015

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
​Hi Reza ​​ I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast 12GB memory per executor for the flat map just for these pairs excluding any other overhead.

documentation - graphx-programming-guide error?

2015-03-01 Thread Deborah Siegel
Hello, I am running through examples given on http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html The section for Map Reduce Triplets Transition Guide (Legacy) indicates that one can run the following .aggregateMessages code val graph: Graph[Int, Float] = ... def msgFun(triplet:

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sabarish, Works fine for me with less than those settings (30x1000 dense matrix, 1GB driver, 1GB executor): bin/spark-shell --driver-memory 1G --executor-memory 1G Then running the following finished without trouble and in a few seconds. Are you sure your driver is actually getting the RAM

Re: Tools to manage workflows on Spark

2015-03-01 Thread Qiang Cao
Thanks, Himanish and Felix! On Sun, Mar 1, 2015 at 7:50 PM, Himanish Kushary himan...@gmail.com wrote: We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline for orchestration of the different spark jobs. AWS datapipeline provides automatic EMR cluster provisioning, retry

Pushing data from AWS Kinesis - Spark Streaming - AWS Redshift

2015-03-01 Thread Mike Trienis
Hi All, I am looking at integrating a data stream from AWS Kinesis to AWS Redshift and since I am already ingesting the data through Spark Streaming, it seems convenient to also push that data to AWS Redshift at the same time. I have taken a look at the AWS kinesis connector although I am not

RE: unsafe memory access in spark 1.2.1

2015-03-01 Thread Zalzberg, Idan (Agoda)
Thanks, We monitor disk space so I doubt that is it, but I will check again From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Sunday, March 01, 2015 11:45 PM To: Zalzberg, Idan (Agoda) Cc: user@spark.apache.org Subject: Re: unsafe memory access in spark 1.2.1 Google led me to:

Re: Connection pool in workers

2015-03-01 Thread Chris Fregly
hey AKM! this is a very common problem. the streaming programming guide addresses this issue here, actually: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#design-patterns-for-using-foreachrdd the tl;dr is this: 1) you want to use foreachPartition() to operate on a whole

unsafe memory access in spark 1.2.1

2015-03-01 Thread Zalzberg, Idan (Agoda)
Hi, I am using spark 1.2.1, sometimes I get these errors sporadically: Any thought on what could be the cause? Thanks 2015-02-27 15:08:47 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread Thread[Executor task launch worker-25,5,main] java.lang.InternalError: a fault occurred

RE: unsafe memory access in spark 1.2.1

2015-03-01 Thread Zalzberg, Idan (Agoda)
My run time version is: java version 1.7.0_75 OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) Thanks From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Sunday, March 01, 2015 10:18 PM To: Zalzberg, Idan (Agoda) Cc:

Re: unsafe memory access in spark 1.2.1

2015-03-01 Thread Ted Yu
Google led me to: https://bugs.openjdk.java.net/browse/JDK-8040802 Not sure if the last comment there applies to your deployment. On Sun, Mar 1, 2015 at 8:32 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: My run time version is: java version 1.7.0_75 OpenJDK Runtime

Re: Scalable JDBCRDD

2015-03-01 Thread eric
What you're saying is that, due to the intensity of the query, you need to run a single query and partition the results, versus running one query for each partition. I assume it's not viable to throw the query results into another table in your database and then query that using the normal

Re: Scalable JDBCRDD

2015-03-01 Thread michal.klo...@gmail.com
Yes exactly. The temp table is an approach but then we need to manage the deletion of it etc. I'm sure we won't be the only people with this crazy use case. If there isn't a feasible way to do this within the framework then that's okay. But if there is a way we are happy to write the code and

Re: Columnar-Oriented RDDs

2015-03-01 Thread Koert Kuipers
Hey, I do not have any statistics. I just wanted to show it can be done but left it at that. The memory usage should be predictable: the benefit comes from using arrays for primitive types. Accessing the data row-wise means re-assembling the rows from the columnar data, which i have not tried to

Re: unsafe memory access in spark 1.2.1

2015-03-01 Thread Ted Yu
What Java version are you using ? Thanks On Sun, Mar 1, 2015 at 7:03 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: Hi, I am using spark 1.2.1, sometimes I get these errors sporadically: Any thought on what could be the cause? Thanks 2015-02-27 15:08:47 ERROR

Re: Problem getting program to run on 15TB input

2015-03-01 Thread Arun Luthra
I tried a shorter simper version of the program, with just 1 RDD, essentially it is: sc.textFile(..., N).map().filter().map( blah = (id, 1L)).reduceByKey().saveAsTextFile(...) Here is a typical GC log trace from one of the yarn container logs: 54.040: [GC [PSYoungGen:

Re: Spark Streaming testing strategies

2015-03-01 Thread Holden Karau
There is also the Spark Testing Base package which is on spark-packages.org and hides the ugly bits (it's based on the existing streaming test code but I cleaned it up a bit to try and limit the number of internals it was touching). On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com

Re: Pushing data from AWS Kinesis - Spark Streaming - AWS Redshift

2015-03-01 Thread Chris Fregly
Hey Mike- Great to see you're using the AWS stack to its fullest! I've already created the Kinesis-Spark Streaming connector with examples, documentation, test, and everything. You'll need to build Spark from source with the -Pkinesis-asl profile, otherwise they won't be included in the build.