Hi,
I am using Spark-1.0.0 over a 3 node cluster with 1 master and 2 slaves. I
am trying to run LR algorithm over Spark Streaming.
package org.apache.spark.examples.streaming;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import
You can look to create a Dstream directly from S3 location using file
stream. If you want to use any specific logic you can rely on Queuestream
read data yourself from S3, process it push it into RDDQueue.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Thanks Xiangrui,
I switched to a Ubuntu 14.04 server and it works after install liblapack3gf
and libopenblas-base.
So it is a environment problem which is not related to Mllib.
--
View this message in context:
Thank you, Andrew!
On 5 June 2014 23:14, Andrew Ash and...@andrewash.com wrote:
Oh my apologies that was for 1.0
For Spark 0.9 I did it like this:
MASTER=spark://mymaster:7077 SPARK_MEM=8g ./bin/spark-shell -c
$CORES_ACROSS_CLUSTER
The downside of this though is that SPARK_MEM also sets
Thank you, Hassan!
On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote:
just use -Dspark.executor.memory=
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html
Sent from the Apache Spark
Where are you getting serialization error. Its likely to be a different
problem. Which class is not getting serialized?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jun 5, 2014 at 6:32 PM, Vibhor Banga
you can comment out this function and Create a new one which will return
your ami-id and the rest of the script will run fine.
def get_spark_ami(opts):
instance_types = {
m1.small:pvm,
m1.medium: pvm,
m1.large:pvm,
m1.xlarge: pvm,
t1.micro:pvm,
c1.medium:
Yes initialization each turn is hard.. you seem to using python. Another
risky thing you can try is to serialize the mongoclient object using any
serializer (like kryo wrappers in python) pass it on to mappers.. then in
each mapper you'll just have to unserialize it use it directly... This
may
I closed that PR for other reasons. This change is still proposed by itself:
https://issues.apache.org/jira/browse/SPARK-2034
https://github.com/apache/spark/pull/980
On Fri, Jun 6, 2014 at 1:35 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Sean,
your patch fixes the issue, thank you so much!
I have the same problem (Spark 0.9.1- 1.0.0 and throws error) and I do call
saveAsTextFile. Recompiled for 1.0.0.
org.apache.spark.SparkException: Job aborted due to stage failure: Task
0.0:10 failed 4 times, most recent failure: Exception failure in TID 1616 on
host
I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
Im using google compute engine and cloud storage. but saveAsTextFile is
returning errors while saving in the cloud or saving local. When i start a
job in the cluster i do get an error but after this error it keeps on
running fine
Hi TD,
I have the same question: I need the workers to process using arrival order
since it's updating a state based on previous one.
tnks in advance.
Rod
--
View this message in context:
Thanks Matei. We have tested the fix and it's working perfectly.
Andrew, we set spark.shuffle.spill=false but the application goes out of
memory. I think that is expected.
Regards,Ajay
On Friday, June 6, 2014 3:49 AM, Andrew Ash and...@andrewash.com wrote:
Hi Ajay,
Can you please try
Hi,
I am trying to run build Shark-0.9.1 from source,with Spark-1.0.0 as its
dependency,using sbt package command.But I am getting the below error
during build,which is making me think that perhaps Spark-1.0.0 is not
compatible with Shark-0.9.1:
[info] Compilation completed in 9.046 s
Thanks for the response Akhil. My email may not have been clear, but my
question is about what should be inside the AMI image, not how to pass an
AMI id in to the spark_ec2 script.
Should certain packages be installed? Do certain directories need to exist?
etc...
On Fri, Jun 6, 2014 at 4:40
Here is the PR https://github.com/apache/spark/pull/997
2014-06-03 19:24 GMT+02:00 Patrick Wendell pwend...@gmail.com:
You can set an arbitrary properties file by adding --properties-file
argument to spark-submit. It would be nice to have spark-submit also
look in SPARK_CONF_DIR as well by
Hi Matt,
You will be needing the following on the AMI:
1. Java Installed
2. Root login enabled
3. /mnt should be available (Since all the storage goes here)
Rest of the things spark-ec2 script will set up for you. Let me know if you
need anymore clarification on this.
Thanks
Best Regards
To add another note on the benefits of using Scala to build Spark, here is
a very interesting and well-written post
http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
on
the Databricks blog about how Scala 2.10's runtime reflection enables
Hi All,
I am passing Java static methods into RDD transformations map and
mapValues. The first map is from a simple string K into a (K,V) where V is
a Java ArrayList of large text strings, 50K each, read from Cassandra.
MapValues does processing of these text blocks into very small ArrayLists.
Hi,
I want to create a (very large) Bayes net using Graphx. To do so, I need to
able to associate conditional probability tables with each node of the
graph. Is there any way to do this? All of the examples I've seen just have
the basic nodes and vertices, no associated information.
thanks, Greg
Where are the API for QueueStream and RddQueue?
In my solution I cannot open a DStream with S3 location because I need
to run a script on the file (a script that unluckily doesn't accept
stdin as input), so I have to download it on my disk somehow than handle
it from there before creating the
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell.
On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov
oleg.proudni...@gmail.com wrote:
Thank you, Hassan!
On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote:
just use -Dspark.executor.memory=
--
View this
we experienced similar issue in our environment, below is the whole stack
trace, it works fine if we run local mode, if we run it in cluster mode
(even with Master and 1 worker on the same node), we have this
serialversionUID issue. we use Spark 1.0.0 and compiled with JDK6.
here is a link about
Thank you, Patrick
I am planning to switch to 1.0 now.
By the way of feedback - I used Andrew's suggestion and found that it does
exactly that - sets Executor JVM heap - and nothing else. Workers have
already been started and when the shell starts, it is now able to control
Executor JVM heap.
Thanks Akhil! I'll give that a try!
Additional observation - the map and mapValues are pipelined and executed -
as expected - in pairs. This means that there is a simple sequence of steps
- first read from Cassandra and then processing for each value of K. This
is the exact behaviour of a normal Java loop with these two steps
There is not an official updated version of Shark for Spark-1.0 (though you
might check out the untested spark-1.0 branch on the github).
You can also check out the preview release of Shark that runs on Spark SQL:
https://github.com/amplab/shark/tree/sparkSql
Michael
On Fri, Jun 6, 2014 at
Is there a repo somewhere with the code for the Hive dependencies (hive-exec,
hive-serde, hive-metastore) used in SparkSQL? Are they forked with
Spark-specific customizations, like Shark, or simply relabeled with a new
package name (org.spark-project.hive)? I couldn't find any repos on Github
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
support. We recently wrote a custom hadoop input format so we can support
flat binary files
(https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
and have been testing it in Scala.
Vertices can have arbitrary properties associated with them:
http://spark.apache.org/docs/latest/graphx-programming-guide.html#the-property-graph
Ankur http://www.ankurdave.com/
They are forked and slightly modified for two reasons:
(a) Hive embeds a bunch of other dependencies in their published jars
such that it makes it really hard for other projects to depend on
them. If you look at the hive-exec jar they copy a bunch of other
dependencies directly into this jar. We
Just ran into this today myself. I'm on branch-1.0 using a CDH3
cluster (no modifications to Spark or its dependencies). The error
appeared trying to run GraphX's .connectedComponents() on a ~200GB
edge list (GraphX worked beautifully on smaller data).
Here's the stacktrace (it's quite similar to
Is anyone experiencing problems with windows?
dstream1.print()
val dstream2 = dstream1.groupByKeyAndWindow(Seconds(60))
dstream2.print()
In my appslication the first print() prints out all the strings and
their keys, but after the window function everything is lost and
nothings gets printed.
Andrew,
Thank you. I'm using mapPartitions() but as you say, it requires that
every partition fit in memory. This will work for now but may not always
work so I was wondering about another way.
Thanks,
Roger
On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash and...@andrewash.com wrote:
Hi Roger,
Someone correct me if this is wrong, but I believe 2 very important things
to know about your cluster are:
1. How many cores does your cluster have available.
2. How much memory does your cluster have available. (Perhaps this could
be divided into total/in-use/free or something.)
Is
Hi,
I am trying to write and debug Spark applications in scala-ide and
maven, and in my code I target at a Spark instance at spark://xxx
object App {
def main(args : Array[String]) {
println( Hello World! )
val sparkConf = new
Great, thanks for the info and pointer to the repo!
From: Patrick Wendellmailto:pwend...@gmail.com
Sent: ?Friday?, ?June? ?6?, ?2014 ?5?:?11? ?PM
To: user@spark.apache.orgmailto:user@spark.apache.org
They are forked and slightly modified for two reasons:
(a) Hive embeds a bunch of other
Minor point, but does anyone else find the new (and super helpful!) kill
link awfully close to the stage detail link in the 1.0.0 Web UI?
I think it would be better to have the kill link flush right, leaving a
large amount of space between it the stage detail link.
Nick
--
View this message
Nick Chammas wrote
I think it would be better to have the kill link flush right, leaving a
large amount of space between it the stage detail link.
I think even better would be to have a pop-up confirmation Do you really
want to kill this stage? :)
--
View this message in context:
Hi!
I have two question:
1.
I want to test my application. My app will write the result to elasticsearch
(stage 1) with saveAsHadoopFile. How can I write test case for it? Need to
pull up a MiniDFSCluster? Or there are other way?
My application flow plan:
Kafka = Spark Streaming (enrich) -
Hi Tathagata,
Im seeing the same issue on a load run over night with Kafka streaming (6000
mgs/sec) and 500millisec batch size. Again occasional and only happening
after a few hours I believe
Im using updateStateByKey.
Any comment will be very welcome.
tnks,
Rod
--
View this message in
This might be a stupid question... but it seems that saveAsParquetFile()
writes everything back to HDFS. I am wondering if it is possible to cache
parquet-format intermediate results in memory, and therefore making spark
sql queries faster.
Thanks.
-Simon
Yup, when it's running, DStream.print() will print out a timestamped block
for every time step, even if the block is empty. (for v1.0.0, which I have
running in the other window)
If you're not getting that, I'd guess the stream hasn't started up
properly.
On Sat, Jun 7, 2014 at 11:50 AM,
It's going well enough that this is a how should I in 1.0.0 rather than
how do i question.
So I've got data coming in via Streaming (twitters) and I want to
archive/log it all. It seems a bit wasteful to generate a new HDFS file for
each DStream, but also I want to guard against data loss from
And then a are you sure after that :)
On 7 Jun 2014 06:59, Mikhail Strebkov streb...@gmail.com wrote:
Nick Chammas wrote
I think it would be better to have the kill link flush right, leaving a
large amount of space between it the stage detail link.
I think even better would be to have a
45 matches
Mail list logo