Hi Stephen,
I am using maven shade plugin for creating my uber jar. I have marked spark
dependencies as provided.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Mon, May 12, 2014 at 1:04 AM, Stephen Boesch java...@gmail.com wrote:
HI
I found that if a file is present in all the nodes in the given path in
localFS, then reading is possible.
But is there a way to read if the file is present only in certain nodes ??
[There should be a way !!]
*NEED: Wanted to do some filter ops in HDFS file, create a local file of
the result,
Sure, I uploaded the code on pastebin: http://pastebin.com/90Hynrjh
On Mon, May 12, 2014 at 12:27 AM, Madhu ma...@madhu.com wrote:
No, you don't need to do anything special to get it to run in Eclipse.
Just add the assembly jar to the build path, create a main method, add your
code, and click
@Sonal - makes sense. Is the maven shade plugin runnable within sbt ? If
so would you care to share those build.sbt (or .scala) lines? If not, are
you aware of a similar plugin for sbt?
2014-05-11 23:53 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com:
Hi Stephen,
I am using maven shade
On Wed, May 7, 2014 at 4:00 AM, Han JU ju.han.fe...@gmail.com wrote:
But in my experience, when reading directly from s3n, spark create only 1
input partition per file, regardless of the file size. This may lead to
some performance problem if you have big files.
You can (and perhaps should)
Yes, Spark goes through the standard HDFS client and will automatically benefit
from this.
Matei
On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi chan...@gmail.com wrote:
Hi all,
Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
sc.textFile() and other HDFS-related APIs?
There was never a Hadoop 2.0.0. There was a Hadoop 2.0.0-alpha as
far as Maven artifacts are concerned. The latest in that series is
2.0.6-alpha.
On Mon, May 12, 2014 at 4:29 AM, Sophia sln-1...@163.com wrote:
I have built shark in sbt way,but the sbt exception turn out:
[error]
I'm trying to run spark-shell on Hadoop yarn.
Specifically, the environment is as follows:
- Client
- OS: Windows 7
- Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8)
- Server
- Platform: hortonworks sandbox 2.1
I modified the spark code to apply
Hi All,
I wanted to launch Spark on Yarn, interactive - yarn client mode.
With default settings of yarn-site.xml and spark-env.sh, i followed the
given link
http://spark.apache.org/docs/0.8.1/running-on-yarn.html
I get the pi value correct when i run without launching the shell.
When i launch
Note the mails are coming out of order in some cases. I am getting current
messages but a sprinkling of old replies too.
On May 12, 2014 12:16 PM, ankurdave ankurd...@gmail.com wrote:
I haven't been getting mail either. This was the last message I received:
Hi,
I set a small cluster with 3 machines, every machine is 64GB RAM, 11
Core. and I used the spark0.9.
I have set spark-env.sh as following:
*SPARK_MASTER_IP=192.168.35.2*
* SPARK_MASTER_PORT=7077*
* SPARK_MASTER_WEBUI_PORT=12306*
* SPARK_WORKER_CORES=3*
*
Hey guys,
I've asked before, in Spark 0.9 - I now use 0.9.1, about removing log4j
dependency and was told that it was gone. However I still find it part of
zookeeper imports. This is fine since I exclude it myself in the sbt file, but
another issue arises.
I wonder if anyone else has run into
Right now I am not using any class variables (references to this). All my
variables are created within the scope of the method I am running.
I did more debugging and found this strange behavior.
variables here
for loop
mapPartitions call
use variables here
end mapPartitions
endfor
Ah, yes, that is correct. You need a serializable object one way or the
other.
An alternate suggestion would be to use a combination of
RDD.sample()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#sampleand
collect() to take a look at some small amount of data and just
Hi all,
I'm currently trying to use pipe to run C++ code on each worker node, and I
have an RDD of essentially command line arguments that i'm passing to each
node. I want to send exactly one element to each node, but when I run my
code, Spark ends up sending multiple elements to a node: is there
Fixed the problem as soon as I sent this out, sigh. Apparently you can do
this by changing the number of slices to cut the dataset into: I thought
that was identical to the amount of partitions, but apparently not.
--
View this message in context:
Those are warning messages instead of errors. You need to add
netlib-java:all to use native BLAS/LAPACK. But it won't work if you
include netlib-java:all in an assembly jar. It has to be a separate
jar when you submit your job. For SGD, we only use level-1 BLAS, so I
don't think native code is
You can just pass it around as a parameter.
On May 12, 2014, at 12:37 PM, yh18190 yh18...@gmail.com wrote:
Hi,
Could anyone suggest an idea how can we create sparkContext object in other
classes or fucntions where we need to convert a scala collection to RDD
using sc object.like
Hi,
Could anyone suggest an idea how can we create sparkContext object in other
classes or fucntions where we need to convert a scala collection to RDD
using sc object.like sc.makeRDD(list).instead of using Main class
sparkcontext object?
is their a way to pass sc object as a parameter to
Dear Sparkers:
I am using Python spark of version 0.9.0 to implement some iterative
algorithm. I got some errors shown at the end of this email. It seems that
it's due to the Java Stack Overflow error. The same error has been
duplicated on a mac desktop and a linux workstation, both running the
You mean you normally get an RDD, right?
A DStream is a sequence of RDDs.
It kind of depends on what you are trying to accomplish here?
sum/count for each RDD in the stream?
On Wed, May 7, 2014 at 6:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
Hi,
I use the following code for calculating
Hi,
I use the following code for calculating average. The problem is that the
reduce operation return a DStream here and not a tuple as it normally does
without Streaming. So how can we get the sum and the count from the DStream.
Can we cast it to tuple?
val numbers =
A few more data points: my current theory is now that spark's piping
mechanism is considerably slower than just running the C++ app directly on
the node.
I ran the C++ application directly on a node in the cluster, and timed the
execution of various parts of the program, and got ~10 seconds to
I was able to compile your code in Eclipse.
I ran it using the data in your comments, but I also see the
NoSuchMethodError you mentioned.
It seems to run fine until the call to calculateZVector(...)
It appears that org.apache.commons.math3.util.Pair is not Serializable, so
that's one potential
It sounds like you are doing everything right.
NoSuchMethodError suggests it's finding log4j, just not the right
version. That method is definitely in 1.2; it might have been removed
in 2.x? (http://logging.apache.org/log4j/2.x/manual/migration.html)
So I wonder if something is sneaking in log4j
I have been experimenting with a data set with and without persisting the RDD
and have come across some unexpected results. The files we are reading are
Avro files so we are using the following to define the RDD, what we end up
with is a RDD[CleansedLogFormat]:
val f = new NewHadoopRDD(sc,
Is that true? I believe that API Chanwit is talking about requires
explicitly asking for files to be cached in HDFS.
Spark automatically benefits from the kernel's page cache (i.e. if
some block is in the kernel's page cache, it will be read more
quickly). But the explicit HDFS cache is a
This gives dependency tree in SBT (spark uses this).
https://github.com/jrudolph/sbt-dependency-graph
TD
On Mon, May 12, 2014 at 4:55 PM, Sean Owen so...@cloudera.com wrote:
It sounds like you are doing everything right.
NoSuchMethodError suggests it's finding log4j, just not the right
A very crucial thing to remember when using file stream is that the files
must be written to the monitored directory atomically. That is when the
file system show the file in its listing, the file should not be appended /
updated after that. That often causes this kind of issues, as spark
Hi Joe,
Your messages are going into spam folder for me.
Thx, Archit_Thakur.
On Fri, May 2, 2014 at 9:22 AM, Joe L selme...@yahoo.com wrote:
Hi, You should include the jar file of your project. for example:
conf.set(yourjarfilepath.jar)
Joe
On Friday, May 2, 2014 7:39 AM, proofmoore
One way to ensure Spark writes more partitions is by using
RDD#repartition() to make each partition smaller. One Spark partition
always corresponds to one file in the underlying store, and it's usually a
good idea to have each partition size range somewhere between 64 MB to 256
MB. Too few
It seems that the code isn't managed in github. Can be downloaded from
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/spark-liblinear-1.94.zip
It will be easier to track the changes in github.
Sincerely,
DB Tsai
I've discovered that it was noticed a year ago that RDD zip() does not work
when the number of partitions does not evenly divide the total number of
elements in the RDD:
https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ
I will enter a JIRA ticket just as soon as the
Hi
Why I always confront remoting error:
akka.remote.remoteTransportException and
java.util.concurrent.timeoutException?
Best Regards,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/build-shark-hadoop-CDH5-on-hadoop2-0-0-CDH4-tp5574p5629.html
Sent from
Thanks, Aaron, this looks like a good solution! Will be trying it out shortly.
I noticed that the S3 exception seem to occur more frequently when the
box is swapping. Why is the box swapping? combineByKey seems to make
the assumption that it can fit an entire partition in memory when
doing the
Use DStream.foreachRDD to do an operation on the final RDD of every batch.
val sumandcount = numbers.map(n = (n.toDouble, 1)).reduce{ (a, b) = (a._1
+ b._1, a._2 + b._2) }
sumandcount.foreachRDD { rdd = val first: (Double, Int) = rdd.take(1) ;
... }
DStream.reduce creates DStream whose RDDs
Since you are using the latest Spark code and not Spark 0.9.1 (guessed from
the log messages), you can actually do graceful shutdown of a streaming
context. This ensures that the receivers are properly stopped and all
received data is processed and then the system terminates (stop() stays
blocked
Jacob Gerard -
You might find the link below useful:
http://rrati.github.io/blog/2014/05/07/apache-hadoop-plus-docker-plus-fedora-running-images/
For non-reverse-dns apps, NAT is your friend.
Cheers,
Tim
- Original Message -
From: Jacob Eisinger jeis...@us.ibm.com
To:
Hey Jim, unfortunately external spilling is not implemented in Python right
now. While it would be possible to update combineByKey to do smarter stuff
here, one simple workaround you can try is to launch more map tasks (or more
reduce tasks). To set the minimum number of map tasks, you can pass
Hi, Adrian --
If my memory serves, you need 1.7.7 of the various slf4j modules to avoid
that issue.
Best.
-- Paul
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
On Mon, May 12, 2014 at 7:51 AM, Adrian Mocanu amoc...@verticalscope.comwrote:
Hey guys,
I've asked
40 matches
Mail list logo