Yes, sorry I meant DAG. I fixed it in my message but not the subject. The
terminology of leaf wasn't helpful I know so hopefully my visual example
was enough. Anyway, I noticed what you said in a local-mode test. I can try
that in a cluster, too. Thank you!
On Thu, Sep 18, 2014 at 10:28 PM,
Sweet, that's probably it. Too bad it didn't seem to make 1.1?
On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust
mich...@databricks.com wrote:
The unknown slowdown might be addressed by
https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd
On Sun, Sep 14, 2014 at
Hi Abel,
Pretty interesting. May I ask how big is your point CSV dataset?
It seems you are relying on searching through the FeatureCollection of
polygons for which one intersects your point. This is going to be
extremely slow. I highly recommend using a SpatialIndex, such as the
many that
Thanks for the info frank.
Twitter's-chill avro serializer looks great.
But how does spark identifies it as serializer, as its not extending from
KryoSerializer.
(sorry scala is an alien lang for me).
-
Thanks Regards,
Mohan
--
View this message in context:
Hello everyone,
What should be the normal time difference between Scala and Python using
Spark? I mean running the same program in the same cluster environment.
In my case I am using numpy array structures for the Python code and
vectors for the Scala code, both for handling my data. The time
Hello!
Could you please add us to your powered by page?
Project name: Ubix.io
Link: http://ubix.io
Components: Spark, Shark, Spark SQL, MLib, GraphX, Spark Streaming, Adam
project
Description: blank for now
Hey, i don't think that's the issue, foreach is called on 'results' which is
a DStream of floats, so naturally it passes RDDs to its function.
And either way, changing the code in the first mapper to comment out the map
reduce process on the RDD
Float f = 1.0f; //nnRdd.map(new FunctionNeuralNet,
Hello everybody,
I'm new to spark streaming and played a bit around with WordCount and a
PageRank-Algorithm in a cluster-environment.
Am I right, that in the cluster each executor computes data stream
separately? And that the result of each executor is independent of the other
executors?
In the
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages. The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
Derp, one caveat to my solution: I guess Spark doesn't use Kryo for
Function serde :(
On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais pw...@yelp.com wrote:
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just
Hi,
I'd made some modifications to the spark source code in the master and
reflected them to the slaves using rsync.
I followed this command:
rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
:path/to/destdirectory.
This worked perfectly. But, I wanted to simultaneously
Hi,
On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
This worked perfectly. But, I wanted to simultaneously rsync all the
slaves. So, added the other slaves as following:
rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
Hi Tobias,
I've copied the files from master to all the slaves.
On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
This worked perfectly. But, I wanted to simultaneously rsync all
,
* you have copied a lot of files from various hosts to username@slave3:path*
only from one node to all the other nodes...
On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi Tobias,
I've copied the files from master to all the slaves.
On Fri, Sep 19, 2014
Here's what we've tried so far as a first example of a custom Mongo receiver
:
/class MongoStreamReceiver(host: String)
extends NetworkReceiver[String] {
protected lazy val blocksGenerator: BlockGenerator =
new BlockGenerator(StorageLevel.MEMORY_AND_DISK_SER_2)
protected def onStart()
-- Forwarded message --
From: rapelly kartheek kartheek.m...@gmail.com
Date: Fri, Sep 19, 2014 at 1:51 PM
Subject: Re: rsync problem
To: Tobias Pfeiffer t...@preferred.jp
any idea why the cluster is dying down???
On Fri, Sep 19, 2014 at 1:47 PM, rapelly kartheek
No, it is actually a quite different 'alpha' project under the same name:
linear algebra DSL on top of H2O and also Spark. It is not really about
algorithm implementations now.
On Sep 19, 2014 1:25 AM, Matthew Farrellee m...@redhat.com wrote:
On 09/18/2014 05:40 PM, Sean Owen wrote:
No, the
The product of each mapPartitions call can be an Iterable of one big Map.
You still need to write some extra custom code like what lookup() does to
exploit this data structure.
On Sep 18, 2014 11:07 PM, Harsha HN 99harsha.h@gmail.com wrote:
Hi All,
My question is related to improving
Hi,
Is there a way to bulk-load to HBase from RDD?
HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but
I cannot figure out how to use it with saveAsHadoopDataset.
Thanks.
Hi,
After reading several documents, it seems that saveAsHadoopDataset cannot
use HFileOutputFormat.
It's because saveAsHadoopDataset method uses JobConf, so it belongs to the
old Hadoop API, while HFileOutputFormat is a member of mapreduce package
which is for the new Hadoop API.
Am I
Hi,
Sorry, I just found saveAsNewAPIHadoopDataset.
Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there
any example code for that?
Thanks.
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, September 19, 2014 8:18 PM
To:
On 09/19/2014 05:06 AM, Sean Owen wrote:
No, it is actually a quite different 'alpha' project under the same
name: linear algebra DSL on top of H2O and also Spark. It is not really
about algorithm implementations now.
On Sep 19, 2014 1:25 AM, Matthew Farrellee m...@redhat.com
Thank you for the example code.
Currently I use foreachPartition() + Put(), but your example code can be used
to clean up my code.
BTW, since the data uploaded by Put() goes through normal HBase write path, it
can be slow.
So, it would be nice if bulk-load could be used, since it
In fact, it seems that Put can be used by HFileOutputFormat, so Put object
itself may not be the problem.
The problem is that TableOutputFormat uses the Put object in the normal way
(that goes through normal write path), while HFileOutFormat uses it to directly
build the HFile.
From:
Apologies in delay in getting back on this. It seems the Kinesis example
does not run on Spark 1.1.0 even when it is built using kinesis-acl profile
because of a dependency conflict in http client (same issue as
Agreed that the bulk import would be faster. In my case, I wasn't expecting
a lot of data to be uploaded to HBase and also, I didn't want to take the
pain of importing generated HFiles into HBase. Is there a way to invoke
HBase HFile import batch script programmatically?
On 19 September 2014
Hi Mohan,
It’s a bit convoluted to follow in their source, but they essentially typedef
KSerializer as being a KryoSerializer, and then their serializers all extend
KSerializer. Spark should identify them properly as Kryo Serializers, but I
haven’t tried it myself.
Regards,
Frank Austin
Jatin,
If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.
RJ
On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Jatin,
HashingTF should be able to solve the memory problem if you use a
small feature dimension in
I'm running out of options trying to integrate cassandra, spark, and the
spark-cassandra-connector.
I quickly found out just grabbing the latest versions of everything
(drivers, etc.) doesn't work--binary incompatibilities it would seem.
So last I tried using versions of drivers from the
Thanks, Shivaram.
Kui
On Sep 19, 2014, at 12:58 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
As R is single-threaded, SparkR launches one R process per-executor on
the worker side.
Thanks
Shivaram
On Thu, Sep 18, 2014 at 7:49 AM, oppokui oppo...@gmail.com wrote:
onStart should be non-blocking. You may try to create a thread in onStart
instead.
- Original Message -
From: t1ny wbr...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Friday, September 19, 2014 1:26:42 AM
Subject: Re: Spark Streaming and ReactiveMongo
Here's what we've tried so
Thank you Soumya Simantha and Tobias. I've deleted the contents of the work
folder in all the nodes.
Now its working perfectly as it was before.
Thank you
Karthik
On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
One possible reason is maybe that the checkpointing
It turns out that it was the Hadoop version that was the issue.
spark-1.0.2-hadoop1 and spark-1.1.0-hadoop1 both work.
spark.1.0.2-hadoop2, spark-1.1.0-hadoop2.4 and spark-1.1.0-hadoop2.4 do not
work.
It's strange because for this little test I am not even using HDFS at all.
-- Eric
On Thu,
Excellent - thats exactly what I needed. I saw iterator() but missed the
toLocalIterator() method
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html
Sent from the Apache Spark
Hi,
I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
Hi, Spark experts,
I have the following issue when using aws java sdk in my spark application.
Here I narrowed down the following steps to reproduce the problem
1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster
2) from the master node, I did the following steps.
spark-shell
Hi Chinchu,
SparkEnv is an internal class that is only meant to be used within Spark.
Outside of Spark, it will be null because there are no executors or driver
to start an environment for. Similarly, SparkFiles is meant to be used
internally (though it's privacy settings should be modified to
I think it's normal.
On Fri, Sep 19, 2014 at 12:07 AM, Luis Guerra luispelay...@gmail.com wrote:
Hello everyone,
What should be the normal time difference between Scala and Python using
Spark? I mean running the same program in the same cluster environment.
In my case I am using numpy array
Hi
I am wrote a little java job to try and figure out how RDD pipe works.
Bellow is my test shell script. If in the script I turn on debugging I get
output. In my console. If debugging is turned off in the shell script, I do
not see anything in my console. Is this a bug or feature?
I am running
Hi,
I have a program similar to the BinaryClassifier example that I am running
using my data (which is fairly small). I run this for 100 iterations. I
observed the following performance:
Standalone mode cluster with 10 nodes (with Spark 1.0.2): 5 minutes
Standalone mode cluster with 10 nodes
What is in 'rdd' here, to double check? Do you mean the spark shell when
you say console? At the end you're grepping output from some redirected
output but where is that from?
On Sep 19, 2014 7:21 PM, Andy Davidson a...@santacruzintegration.com
wrote:
Hi
I am wrote a little java job to try and
Hey just a minor clarification, you _can_ use SparkFiles.get in your
application only if it runs on the executors, e.g. in the following way:
sc.parallelize(1 to 100).map { i = SparkFiles.get(my.file) }.collect()
But not in general (otherwise NPE, as in your case). Perhaps this should be
Hi,
I am using the latest release Spark 1.1.0. I am trying to build the
streaming examples (under examples/streaming) as a standalone project with
the following streaming.sbt file. When I run sbt assembly, I get an error
stating that object algebird is not a member of package com.twitter. I
Hi Evan,
here a improved version, thanks for your advice. But you know the last step,
the SaveAsTextFile is very Slw, :(
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.net.URL
import java.text.SimpleDateFormat
import
I successfully did this once.
RDD map to RDD [(ImmutableBytesWritable, KeyValue)]
then
val conf = HBaseConfiguration.create()
val job = new Job (conf, CEF2HFile)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]);
job.setMapOutputValueClass (classOf[KeyValue]);
val table = new
Have you set spark.local.dir (I think this is the config setting)?
It needs to point to a volume with plenty of space.
By default if I recall it point to /tmp
Sent from my iPhone
On 19 Sep 2014, at 23:35, jw.cmu jinliangw...@gmail.com wrote:
I'm trying to run Spark ALS using the netflix
What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen so...@cloudera.com
47 matches
Mail list logo