Thanks Andrew. that helps
On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List] <
ml-node+s1001560n14708...@n3.nabble.com> wrote:
> Hey just a minor clarification, you _can_ use SparkFiles.get in your
> application only if it runs on the executors, e.g. in the following way:
>
Hi,
I'm developing an application with spark-streaming-kafka, which
depends on spark-streaming and kafka. Since spark-streaming is
provided in runtime, I want to exclude the jars from the assembly. I
tried the following configuration:
libraryDependencies ++= {
val sparkVersion = "1.0.2"
Seq(
So sorry about teasing you with the Scala. But the method is there in Java
too, I just checked.
On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen wrote:
> It might not be the same as a real hadoop reducer, but I think it would
> accomplish the same. Take a look at:
>
> import org.apache.spark.
Hi,
I'm using Spark Streaming 1.0.
Say I have a source of website click stream, like the following:
('2014-09-19 00:00:00', '192.168.1.1', 'home_page')
('2014-09-19 00:00:01', '192.168.1.2', 'list_page')
...
And I want to calculate the page views (PV, number of logs) and unique
user (UV, identi
What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen wrote:
> Base 64 i
Have you set spark.local.dir (I think this is the config setting)?
It needs to point to a volume with plenty of space.
By default if I recall it point to /tmp
Sent from my iPhone
> On 19 Sep 2014, at 23:35, "jw.cmu" wrote:
>
> I'm trying to run Spark ALS using the netflix dataset but failed d
I successfully did this once.
RDD map to RDD [(ImmutableBytesWritable, KeyValue)]
then
val conf = HBaseConfiguration.create()
val job = new Job (conf, "CEF2HFile")
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]);
job.setMapOutputValueClass (classOf[KeyValue]);
val table = new HTable(con
Hi Evan,
here a improved version, thanks for your advice. But you know the last step,
the SaveAsTextFile is very Slw, :(
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.net.URL
import java.text.SimpleDateFormat
import c
Hi,
I am using the latest release Spark 1.1.0. I am trying to build the
streaming examples (under examples/streaming) as a standalone project with
the following streaming.sbt file. When I run sbt assembly, I get an error
stating that object algebird is not a member of package com.twitter. I
tried
Hey just a minor clarification, you _can_ use SparkFiles.get in your
application only if it runs on the executors, e.g. in the following way:
sc.parallelize(1 to 100).map { i => SparkFiles.get("my.file") }.collect()
But not in general (otherwise NPE, as in your case). Perhaps this should be
docum
Your proposed use of rdd.pipe("foo") to communicate with an external
process seems fine. The "foo" program should read its input from
stdin, perform its computations, and write its results back to stdout.
Note that "foo" will be run on the workers, invoked once per
partition, and the result will be
I have created JIRA ticket 3610 for the issue. thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-load-app-logs-for-MLLib-programs-in-history-server-tp14627p14706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Hi Jey
Many thanks for the code example. Here is what I really want to do. I want
to use Spark Stream and python. Unfortunately pySpark does not support
streams yet. It was suggested the way to work around this was to use an RDD
pipe. The example bellow was a little experiment.
You can think of m
I'm trying to run Spark ALS using the netflix dataset but failed due to "No
space on device" exception. It seems the exception is thrown after the
training phase. It's not clear to me what is being written and where is the
output directory.
I was able to run the same code on the provided test.data
Please see http://hbase.apache.org/book.html#completebulkload
LoadIncrementalHFiles has a main() method.
On Fri, Sep 19, 2014 at 5:41 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> Agreed that the bulk import would be faster. In my case, I wasn't
> expecting a lot of data to be upl
It might not be the same as a real hadoop reducer, but I think it would
accomplish the same. Take a look at:
import org.apache.spark.SparkContext._
// val rdd: RDD[(K, V)]
// def zero(value: V): S
// def reduce(agg: S, value: V): S
// def merge(agg1: S, agg2: S): S
val reducedUnsorted: RDD[(K, S)]
How many cores do you have in your boxes?
looks like you are assigning 32 cores "per" executor - is that what you want?
are there other applications running on the cluster? you might want to check
YARN UI to see how many containers are getting allocated to your application.
On Sep 19, 2014, a
I'm launching a Spark shell with the following parameters
./spark-shell --master yarn-client --executor-memory 32g --driver-memory 4g
--executor-cores 32 --num-executors 8
but when I look at the Spark UI it shows only 209.3 GB total memory.
Executors (10)
- *Memory:* 55.9 GB Used (209.3 GB
I am struggling to reproduce the functionality of a Hadoop reducer on
Spark (in Java)
in Hadoop I have a function
public void doReduce(K key, Iterator values)
in Hadoop there is also a consumer (context write) which can be seen as
consume(key,value)
In my code
1) knowing the key is important to
Hi Andy,
That's a feature -- you'll have to print out the return value from
collect() if you want the contents to show up on stdout.
Probably something like this:
for(Iterator iter = rdd.pipe(pwd +
"/src/main/bin/RDDPipe.sh").collect().iterator(); iter.hasNext();)
System.out.println(iter.next
What is in 'rdd' here, to double check? Do you mean the spark shell when
you say console? At the end you're grepping output from some redirected
output but where is that from?
On Sep 19, 2014 7:21 PM, "Andy Davidson"
wrote:
> Hi
>
> I am wrote a little java job to try and figure out how RDD pipe
Hi,
I have a program similar to the BinaryClassifier example that I am running
using my data (which is fairly small). I run this for 100 iterations. I
observed the following performance:
Standalone mode cluster with 10 nodes (with Spark 1.0.2): 5 minutes
Standalone mode cluster with 10 nodes (wi
Hi
I am wrote a little java job to try and figure out how RDD pipe works.
Bellow is my test shell script. If in the script I turn on debugging I get
output. In my console. If debugging is turned off in the shell script, I do
not see anything in my console. Is this a bug or feature?
I am running t
I think it's normal.
On Fri, Sep 19, 2014 at 12:07 AM, Luis Guerra wrote:
> Hello everyone,
>
> What should be the normal time difference between Scala and Python using
> Spark? I mean running the same program in the same cluster environment.
>
> In my case I am using numpy array structures for t
Hi Chinchu,
SparkEnv is an internal class that is only meant to be used within Spark.
Outside of Spark, it will be null because there are no executors or driver
to start an environment for. Similarly, SparkFiles is meant to be used
internally (though it's privacy settings should be modified to ref
Hi, Spark experts,
I have the following issue when using aws java sdk in my spark application.
Here I narrowed down the following steps to reproduce the problem
1) I have Spark 1.1.0 with hadoop 2.4 installed on 3 nodes cluster
2) from the master node, I did the following steps.
spark-shell --
Hi,
I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
Excellent - thats exactly what I needed. I saw iterator() but missed the
toLocalIterator() method
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html
Sent from the Apache Spark Us
It turns out that it was the Hadoop version that was the issue.
spark-1.0.2-hadoop1 and spark-1.1.0-hadoop1 both work.
spark.1.0.2-hadoop2, spark-1.1.0-hadoop2.4 and spark-1.1.0-hadoop2.4 do not
work.
It's strange because for this little test I am not even using HDFS at all.
-- Eric
On Thu,
Thank you Soumya Simantha and Tobias. I've deleted the contents of the work
folder in all the nodes.
Now its working perfectly as it was before.
Thank you
Karthik
On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta
wrote:
> One possible reason is maybe that the checkpointing directory
> $SPARK_HOME
onStart should be non-blocking. You may try to create a thread in onStart
instead.
- Original Message -
From: "t1ny"
To: u...@spark.incubator.apache.org
Sent: Friday, September 19, 2014 1:26:42 AM
Subject: Re: Spark Streaming and ReactiveMongo
Here's what we've tried so far as a first e
Thanks, Shivaram.
Kui
> On Sep 19, 2014, at 12:58 AM, Shivaram Venkataraman
> wrote:
>
> As R is single-threaded, SparkR launches one R process per-executor on
> the worker side.
>
> Thanks
> Shivaram
>
> On Thu, Sep 18, 2014 at 7:49 AM, oppokui wrote:
>> Shivaram,
>>
>> As I know, SparkR
I'm running out of options trying to integrate cassandra, spark, and the
spark-cassandra-connector.
I quickly found out just grabbing the latest versions of everything
(drivers, etc.) doesn't work--binary incompatibilities it would seem.
So last I tried using versions of drivers from the
spark-ca
Jatin,
If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.
RJ
On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng wrote:
> Hi Jatin,
>
> HashingTF should be able to solve the memory problem if you use a
> small feature dimension in HashingTF. Please do
Hi Mohan,
It’s a bit convoluted to follow in their source, but they essentially typedef
KSerializer as being a KryoSerializer, and then their serializers all extend
KSerializer. Spark should identify them properly as Kryo Serializers, but I
haven’t tried it myself.
Regards,
Frank Austin Notha
Agreed that the bulk import would be faster. In my case, I wasn't expecting
a lot of data to be uploaded to HBase and also, I didn't want to take the
pain of importing generated HFiles into HBase. Is there a way to invoke
HBase HFile import batch script programmatically?
On 19 September 2014 17:58
Apologies in delay in getting back on this. It seems the Kinesis example
does not run on Spark 1.1.0 even when it is built using kinesis-acl profile
because of a dependency conflict in http client (same issue as
http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5j
In fact, it seems that Put can be used by HFileOutputFormat, so Put object
itself may not be the problem.
The problem is that TableOutputFormat uses the Put object in the normal way
(that goes through normal write path), while HFileOutFormat uses it to directly
build the HFile.
From: innowi
Thank you for the example code.
Currently I use foreachPartition() + Put(), but your example code can be used
to clean up my code.
BTW, since the data uploaded by Put() goes through normal HBase write path, it
can be slow.
So, it would be nice if bulk-load could be used, since it bypasse
I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat
instead of HFileOutputFormat. But, hopefully this should help you:
val hbaseZookeeperQuorum =
s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", hbase
On 09/19/2014 05:06 AM, Sean Owen wrote:
No, it is actually a quite different 'alpha' project under the same
name: linear algebra DSL on top of H2O and also Spark. It is not really
about algorithm implementations now.
On Sep 19, 2014 1:25 AM, "Matthew Farrellee" mailto:m...@redhat.com>> wrote:
Hi,
Sorry, I just found saveAsNewAPIHadoopDataset.
Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there
any example code for that?
Thanks.
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, September 19, 2014 8:18 PM
To: user@spark
Hi,
After reading several documents, it seems that saveAsHadoopDataset cannot
use HFileOutputFormat.
It's because saveAsHadoopDataset method uses JobConf, so it belongs to the
old Hadoop API, while HFileOutputFormat is a member of mapreduce package
which is for the new Hadoop API.
Am I rig
One possible reason is maybe that the checkpointing directory
$SPARK_HOME/work is rsynced as well.
Try emptying the contents of the work folder on each node and try again.
On Fri, Sep 19, 2014 at 4:53 AM, rapelly kartheek
wrote:
> I
> * followed this command:rsync -avL --progress path/to/spark
Hi,
Is there a way to bulk-load to HBase from RDD?
HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but
I cannot figure out how to use it with saveAsHadoopDataset.
Thanks.
The product of each mapPartitions call can be an Iterable of one big Map.
You still need to write some extra custom code like what lookup() does to
exploit this data structure.
On Sep 18, 2014 11:07 PM, "Harsha HN" <99harsha.h@gmail.com> wrote:
> Hi All,
>
> My question is related to improving
No, it is actually a quite different 'alpha' project under the same name:
linear algebra DSL on top of H2O and also Spark. It is not really about
algorithm implementations now.
On Sep 19, 2014 1:25 AM, "Matthew Farrellee" wrote:
> On 09/18/2014 05:40 PM, Sean Owen wrote:
>
>> No, the architecture
I
* followed this command:rsync -avL --progress path/to/spark-1.0.0
username@destinationhostname:*
*path/to/destdirectory. Anyway, for now, I did it individually for each
node.*
I have copied to each node at a time individually using the above command.
So, I guess the copying may not contain any
Hi,
On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek
wrote:
> > ,
>
> * you have copied a lot of files from various hosts to
> username@slave3:path*
> only from one node to all the other nodes...
>
I don't think rsync can do that in one command as you described. My guess
is that now you have a
-- Forwarded message --
From: rapelly kartheek
Date: Fri, Sep 19, 2014 at 1:51 PM
Subject: Re: rsync problem
To: Tobias Pfeiffer
any idea why the cluster is dying down???
On Fri, Sep 19, 2014 at 1:47 PM, rapelly kartheek
wrote:
> ,
>
>
> * you have copied a lot of files from
Here's what we've tried so far as a first example of a custom Mongo receiver
:
/class MongoStreamReceiver(host: String)
extends NetworkReceiver[String] {
protected lazy val blocksGenerator: BlockGenerator =
new BlockGenerator(StorageLevel.MEMORY_AND_DISK_SER_2)
protected def onStart()
,
* you have copied a lot of files from various hosts to username@slave3:path*
only from one node to all the other nodes...
On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek
wrote:
> Hi Tobias,
>
> I've copied the files from master to all the slaves.
>
> On Fri, Sep 19, 2014 at 1:37 PM, Tobias
Hi Tobias,
I've copied the files from master to all the slaves.
On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer wrote:
> Hi,
>
> On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek > wrote:
>>
>> This worked perfectly. But, I wanted to simultaneously rsync all the
>> slaves. So, added the other
Hi,
On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek
wrote:
>
> This worked perfectly. But, I wanted to simultaneously rsync all the
> slaves. So, added the other slaves as following:
>
> rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
> :path/to/destdirectory username@sl
Hi,
I'd made some modifications to the spark source code in the master and
reflected them to the slaves using rsync.
I followed this command:
rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
:path/to/destdirectory.
This worked perfectly. But, I wanted to simultaneously rs
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for
Function serde :(
On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote:
> Well it looks like this is indeed a protobuf issue. Poked a little more
> with Kryo. Since protobuf messages are serializable, I tried just making
> Kryo
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages. The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
classloade
Hey, i don't think that's the issue, foreach is called on 'results' which is
a DStream of floats, so naturally it passes RDDs to its function.
And either way, changing the code in the first mapper to comment out the map
reduce process on the RDD
Float f = 1.0f; //nnRdd.map(new Function() {
Hello everybody,
I'm new to spark streaming and played a bit around with WordCount and a
PageRank-Algorithm in a cluster-environment.
Am I right, that in the cluster each executor computes data stream
separately? And that the result of each executor is independent of the other
executors?
In the
Hello!
Could you please add us to your "powered by" page?
Project name: Ubix.io
Link: http://ubix.io
Components: Spark, Shark, Spark SQL, MLib, GraphX, Spark Streaming, Adam
project
Description:
Hello everyone,
What should be the normal time difference between Scala and Python using
Spark? I mean running the same program in the same cluster environment.
In my case I am using numpy array structures for the Python code and
vectors for the Scala code, both for handling my data. The time dif
Thanks for the info frank.
Twitter's-chill avro serializer looks great.
But how does spark identifies it as serializer, as its not extending from
KryoSerializer.
(sorry scala is an alien lang for me).
-
Thanks & Regards,
Mohan
--
View this message in context:
http://apache-spark-user-list.
62 matches
Mail list logo