Thanks, that was what I was missing!
arun
arun
*__*
*Arun Swami*
+1 408-338-0906
On Fri, May 2, 2014 at 4:28 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
You need to first partition the data by the key
Use mappartition instead of map.
Mayur Rustagi
Ph: +1 (760) 203 3257
On Sat, May 3, 2014 at 2:00 AM, SK skrishna...@gmail.com wrote:
1) I have a csv file where one of the field has integer data but it appears
as strings: 1, 3 etc. I tried using toInt to implcitly convert the
strings to int after reading (field(3).toInt). But I got a
NumberFormatException. So I
Hi, thanks for all your help.
I tried your setting in the sbt file, but the problem is still there.
The Java setting in my sbt file is:
java \
-Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
-jar ${JAR} \
$@
I have tried to set these 3 parameters bigger and smaller, but
Hi TD, Thanks for reply
This last experiment I did with one computer, like local, but I think that time
gap grow up when I add more computer. I will do again now with 8 worker and 1
word source and I will see what’s go on. I will control the time too, like
suggested by Andrew.
On May 3, 2014,
in spark kafka example,
it says
`./bin/run-example org.apache.spark.streaming.examples.KafkaWordCount
local[2] zoo01,zoo02,zoo03 my-consumer-group topic1,topic2 1`
can any one tell me what does local[2] represent ? i thought master url
should be sth like spark://hostname:portname .
also,
Poking around in the bowels of scala, it seems like this has something to
do with implicit scala - java collection munging. Why would it be doing
this and where? The stack trace given is entirely unhelpful to me. Is there
a better one buried in my task logs? None of my tasks actually failed, so
it
Hi Weide,
The answer to your first question about local[2] can be found in the
Running the Examples and Shell section of
https://spark.apache.org/docs/latest/
Note that all of the sample programs take a master parameter specifying
the cluster URL to connect to. This can be a URL for a
Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to
write something to the HDFS. The content is written correctly, but I do
not like this warning.
DO I have to compile SPARK with hadoop 2.4?
WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException
Hi, all
I compiled the Spark-1.0.0-rc3 against Hadoop 2.3.0 with
SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly
and deploy the spark with HDFS 2.3.0 (yarn = false), it seems that it cannot
read data from HDFS
the following exception is thrown
java.lang.VerifyError: class
The problem is probably not with the JVM running sbt but with the one that
sbt is forking to run your program.
See here for the relevant option:
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L186
You might try starting sbt with no arguments (to bring up the sbt console).
if you want to use true Spark Streaming (not the same as Hadoop
Streaming/Piping, as Mayur pointed out), you can use the DStream.union()
method as described in the following docs:
http://spark.apache.org/docs/0.9.1/streaming-custom-receivers.html
not sure if this directly addresses your issue, peter, but it's worth
mentioned a handy AWS EMR utility called s3distcp that can upload a single
HDFS file - in parallel - to a single, concatenated S3 file once all the
partitions are uploaded. kinda cool.
here's some info:
Hi Peter,
If your dataset is large (3GB) then why not just chunk it into
multiple files? You'll need to do this anyways if you want to
read/write this from S3 quickly, because S3's throughput per file is
limited (as you've seen).
This is exactly why the Hadoop API lets you save datasets with
Spark will only work with Scala 2.10... are you trying to do a minor
version upgrade or upgrade to a totally different version?
You could do this as follows if you want:
1. Fork the spark-ec2 repository and change the file here:
https://github.com/mesos/spark-ec2/blob/v2/scala/init.sh
2. Modify
Broadcast variables need to fit entirely in memory - so that's a
pretty good litmus test for whether or not to broadcast a smaller
dataset or turn it into an RDD.
On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma scrapco...@gmail.com wrote:
I had like to be corrected on this but I am just trying
Hi Diana,
Apart from these reasons, in a multi-stage job, Spark saves the map output
files from map stages to the filesystem, so it only needs to rerun the last
reduce stage. This is why you only saw one stage executing. These files are
saved for fault recovery but they speed up subsequent
Hi Eduardo,
Yep those machines look pretty well synchronized at this point. Just
wanted to throw that out there and eliminate it as a possible source of
confusion.
Good luck on continuing the debugging!
Andrew
On Sat, May 3, 2014 at 11:59 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it
as Mayur indicated, it's odd that you are seeing better performance from a
less-local configuration. however, the non-deterministic behavior that you
describe is likely caused by GC pauses in your JVM process.
take note of the *spark.locality.wait* configuration parameter described
here:
Hi I'm trying to run the kafka-word-count example in spark2.9.1. I
encountered some exception when initialize kafka consumer/producer config.
I'm using scala 2.10.3 and used maven build inside spark streaming kafka
library comes with spark2.9.1. Any one see this exception before?
Thanks,
Hi Tathagata,
I figured out the reason. I was adding a wrong kafka lib along side with
the version spark uses. Sorry for spamming.
Weide
On Sat, May 3, 2014 at 7:04 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
I am a little confused about the version of Spark you are using. Are you
Hi Tathagata,
I actually have a separate question. What's the usage of lib_managed folder
inside spark source folder ? Are those the library required for spark
streaming to run ? Do they needed to be added to spark classpath when
starting sparking cluster?
Weide
On Sat, May 3, 2014 at 7:08 PM,
I'm using spark for LDA impementation. I need cache RDD for next step of
Gibbs Sampling, and cached the result and the cache previous could be
uncache. Something like LRU cache should delete the previous cache because
it is never used then, but the cache runs into confusion:
Here is the code:)
Hey Matei,
Not sure i understand that. These are 2 separate jobs. So the second job
takes advantage of the fact that there is map output left somewhere on disk
from the first job, and re-uses that?
On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Hi Diana,
Apart
Hi Michael,
Thank you very much for your reply.
Sorry I am not very familiar with sbt. Could you tell me where to set the
Java option for the sbt fork for my program? I brought up the sbt console,
and run set javaOptions += -Xmx1G in it, but it returned an error:
[error]
Yes, this happens as long as you use the same RDD. For example say you do the
following:
data1 = sc.textFile(…).map(…).reduceByKey(…)
data1.count()
data1.filter(…).count()
The first count() causes outputs of the map/reduce pair in there to be written
out to shuffle files. Next time you do a
25 matches
Mail list logo