We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline
for orchestration of the different spark jobs. AWS datapipeline provides
automatic EMR cluster provisioning, retry on failure,SNS notification etc.
out of the box and works well for us.
On Sun, Mar 1, 2015 at 7:02 PM,
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas gerard.m...@gmail.com wrote:
KafkaOutputServicePool
Could you please give an example code of how KafkaOutputServicePool would
look like? When I tried object pooling I end up with various not
serializable exceptions.
Thanks!
Josh
We use Oozie as well, and it has worked well.
The catch is each action in Oozie is separate and one cannot retain
SparkContext or RDD, or leverage caching or temp table, going into another
Oozie action. You could either save output to file or put all Spark processing
into one Oozie action.
---
GML is a fast, distributed, in-memory sparse (and dense) matrix libraries.
It does not use RDDs for resilience. Instead we have examples that use
Resilient X10 (which provides recovery of distributed control structures
in case of node failure) and Hazelcast for stable storage.
We are looking
class DeepCNNFeature extends Transformer ... {
override def transform(data: DataFrame, paramMap: ParamMap): DataFrame
= {
// How can I do a map partition on the underlying RDD and
then add the column ?
}
}
On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa
Hi,
Someone else will have a better answer. I think that for standalone mode,
executors will grab whatever cores they can based on either configurations
on the worker, or application specific configurations. Could be wrong, but
I believe mesos is similar to this- and that YARN is alone in the
What database are you using?
Le 28 févr. 2015 18:15, Michal Klos michal.klo...@gmail.com a écrit :
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute the query against
our huge database and not
Sorry guys may bad,
Here is a high level code sample,
val unionStreams = ssc.union(kinesisStreams)
unionStreams.foreachRDD(rdd = {
rdd.foreach(tweet =
val strTweet = new String(tweet, UTF-8)
val interaction = InteractionParser.parser(strTweet)
interactionDAL.insert(interaction)
)
I'm a little confused by your comments regarding LIMIT. There's nothing
about JdbcRDD that depends on limit. You just need to be able to partition
your data in some way such that it has numeric upper and lower bounds.
Primary key range scans, not limit, would ordinarily be the best way to do
I have started using Spark and Spark Streaming and I'm wondering how do you
test your applications? Especially Spark Streaming application with window
based transformations.
After some digging I found ManualClock class to take full control over
stream processing. Unfortunately the class is not
Thanks Vijay, but the setup requirement for GML was not straightforward for
me at all, so I put it aside for a while.
best,
/Shahab
On Sun, Mar 1, 2015 at 9:34 AM, Vijay Saraswat vi...@saraswat.org wrote:
GML is a fast, distributed, in-memory sparse (and dense) matrix
libraries.
It does
Hi
The Issue was not fixed .
I removed the between sql layer and directly created features from the file.
Regards
Jishnu Prathap
From: lovelylavs [via Apache Spark User List]
[mailto:ml-node+s1001560n21862...@n3.nabble.com]
Sent: Sunday, March 01, 2015 4:44 AM
To: Jishnu Menath Prathap (WT01 -
Thanks Josef for the comments, I think I need to do some benchmarking.
best,
/Shahab
On Sun, Mar 1, 2015 at 1:25 AM, Joseph Bradley jos...@databricks.com
wrote:
Hi Shahab,
There are actually a few distributed Matrix types which support sparse
representations: RowMatrix, IndexedRowMatrix,
Hi there,
I'm trying out Spark Job Server (REST) to submit jobs to spark cluster. I
believe that my problem is unrelated to this specific software, but
otherwise generic issue with missing jars on paths. So every application
implements the trait with SparkJob class:
/object LongPiJob extends
Hi Joseph,
Thank your for the tips. I understand what should I do when my data are
represented as a RDD. The thing that I can't figure out is how to do the
same thing when the data is view as a DataFrame and I need to add the
result of my pretrained model as a new column in the DataFrame.
Thanks for the comments guys.
Parquet is awesome. My question with using Parquet for on disk storage -
how should I load that into memory as a spark RDD and cache it and keep it
in a columnar format?
I know I can use Spark SQL with parquet which is awesome. But as soon as I
step out of SQL we
Jorn: Vertica
Cody: I posited the limit just as an example of how jdbcrdd could be used least
invasively. Let's say we used a partition on a time field -- we would still
need to have N executions of those queries. The queries we have are very
intense and concurrency is an issue even if the
I am trying to store my word count output into hive data warehouse my
pipeline is:
Flume streaming = spark do word count = store result in hive table for
visualization later
my code is :
*import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import
I am trying to compute column similarities on a 30x1000 RowMatrix of
DenseVectors. The size of the input RDD is 3.1MB and its all in one
partition. I am running on a single node of 15G and giving the driver 1G
and the executor 9G. This is on a single node hadoop. In the first attempt
the
spark.executor.extraClassPath is especially useful when the output is
written to HBase, since the data nodes on the cluster have HBase library
jars.
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday, February 27, 2015 5:22 PM
To: Kannan Rajah
Cc: Marcelo
Sorry, I actually meant 30 x 1 matrix (missed a 0)
Regards
Sab
Hi Sab,
In this dense case, the output will contain 1 x 1 entries, i.e. 100
million doubles, which doesn't fit in 1GB with overheads.
For a dense matrix, similarColumns() scales quadratically in the number of
columns, so you need more memory across the cluster.
Reza
On Sun, Mar 1, 2015
Hi Reza
I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled by the threshold. But assuming a simple
1x10K matrix that means I would need atleast 12GB memory per executor for
the flat map just for these pairs excluding any other overhead.
Hello,
I am running through examples given on
http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html
The section for Map Reduce Triplets Transition Guide (Legacy) indicates
that one can run the following .aggregateMessages code
val graph: Graph[Int, Float] = ...
def msgFun(triplet:
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
Hi Sabarish,
Works fine for me with less than those settings (30x1000 dense matrix, 1GB
driver, 1GB executor):
bin/spark-shell --driver-memory 1G --executor-memory 1G
Then running the following finished without trouble and in a few seconds.
Are you sure your driver is actually getting the RAM
Thanks, Himanish and Felix!
On Sun, Mar 1, 2015 at 7:50 PM, Himanish Kushary himan...@gmail.com wrote:
We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline
for orchestration of the different spark jobs. AWS datapipeline provides
automatic EMR cluster provisioning, retry
Hi All,
I am looking at integrating a data stream from AWS Kinesis to AWS Redshift
and since I am already ingesting the data through Spark Streaming, it seems
convenient to also push that data to AWS Redshift at the same time.
I have taken a look at the AWS kinesis connector although I am not
Thanks,
We monitor disk space so I doubt that is it, but I will check again
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Sunday, March 01, 2015 11:45 PM
To: Zalzberg, Idan (Agoda)
Cc: user@spark.apache.org
Subject: Re: unsafe memory access in spark 1.2.1
Google led me to:
hey AKM!
this is a very common problem. the streaming programming guide addresses
this issue here, actually:
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
the tl;dr is this:
1) you want to use foreachPartition() to operate on a whole
Hi,
I am using spark 1.2.1, sometimes I get these errors sporadically:
Any thought on what could be the cause?
Thanks
2015-02-27 15:08:47 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception
in thread Thread[Executor task launch worker-25,5,main]
java.lang.InternalError: a fault occurred
My run time version is:
java version 1.7.0_75
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
Thanks
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Sunday, March 01, 2015 10:18 PM
To: Zalzberg, Idan (Agoda)
Cc:
Google led me to:
https://bugs.openjdk.java.net/browse/JDK-8040802
Not sure if the last comment there applies to your deployment.
On Sun, Mar 1, 2015 at 8:32 AM, Zalzberg, Idan (Agoda)
idan.zalzb...@agoda.com wrote:
My run time version is:
java version 1.7.0_75
OpenJDK Runtime
What you're saying is that, due to the intensity of the query, you need
to run a single query and partition the results, versus running one
query for each partition.
I assume it's not viable to throw the query results into another table
in your database and then query that using the normal
Yes exactly.
The temp table is an approach but then we need to manage the deletion of it etc.
I'm sure we won't be the only people with this crazy use case.
If there isn't a feasible way to do this within the framework then that's
okay. But if there is a way we are happy to write the code and
Hey,
I do not have any statistics. I just wanted to show it can be done but left
it at that. The memory usage should be predictable: the benefit comes from
using arrays for primitive types. Accessing the data row-wise means
re-assembling the rows from the columnar data, which i have not tried to
What Java version are you using ?
Thanks
On Sun, Mar 1, 2015 at 7:03 AM, Zalzberg, Idan (Agoda)
idan.zalzb...@agoda.com wrote:
Hi,
I am using spark 1.2.1, sometimes I get these errors sporadically:
Any thought on what could be the cause?
Thanks
2015-02-27 15:08:47 ERROR
I tried a shorter simper version of the program, with just 1 RDD,
essentially it is:
sc.textFile(..., N).map().filter().map( blah = (id,
1L)).reduceByKey().saveAsTextFile(...)
Here is a typical GC log trace from one of the yarn container logs:
54.040: [GC [PSYoungGen:
There is also the Spark Testing Base package which is on spark-packages.org
and hides the ugly bits (it's based on the existing streaming test code but
I cleaned it up a bit to try and limit the number of internals it was
touching).
On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com
Hey Mike-
Great to see you're using the AWS stack to its fullest!
I've already created the Kinesis-Spark Streaming connector with examples,
documentation, test, and everything. You'll need to build Spark from
source with the -Pkinesis-asl profile, otherwise they won't be included in
the build.
40 matches
Mail list logo