I don't think you can use rawSocketStream since the RSVP is from a web
server and you will have to send a GET request first to initialize the
communication. You are better off writing a custom receiver
https://spark.apache.org/docs/latest/streaming-custom-receivers.html for
your usecase. For a
You have an issue with your cluster setup. Can you paste your
conf/spark-env.sh and the conf/slaves files here?
The reason why your job is running fine is because you set the master
inside the job as local[*] which runs in local mode (not in standalone
cluster mode).
Thanks
Best Regards
On
I don't see spark-streaming dependency at com.datastax.spark
http://mvnrepository.com/artifact/com.datastax.spark, but it does has a
kafka-streaming dependency though.
Thanks
Best Regards
On Tue, May 5, 2015 at 12:42 AM, Eric Ho eric...@intel.com wrote:
Can I specify this in my build file ?
Here's a complete example
https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
Thanks
Best Regards
On Mon, May 4, 2015 at 12:57 PM, Yasemin Kaya godo...@gmail.com wrote:
Hi!
I am new at Spark and I want to begin Spark with simple wordCount example
in Java. But I want to give
It could be filling up your /tmp directory. You need to set your
spark.local.dir or you can also specify SPARK_WORKER_DIR to another
location which has sufficient space.
Thanks
Best Regards
On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote:
Hi,
I am getting No space left
Can you paste the complete stacktrace? It looks like you are having version
incompatibility with hadoop.
Thanks
Best Regards
On Sat, May 2, 2015 at 4:36 PM, drarse drarse.a...@gmail.com wrote:
When I run my program with Spark-Submit everythink are ok. But when I try
run in satandalone mode I
With filestream you can actually pass a filter parameter to avoid loading
up .tmp file/directories.
Also, when you move/rename a file, the file creation date doesn't change
and hence spark won't detect them i believe.
Thanks
Best Regards
On Sat, May 2, 2015 at 9:37 PM, Evo Eftimov
Looks like a version incompatibility, just make sure you have the proper
version of spark. Also look further in the stacktrace what is causing
Futures timed out (it could be a network issue also if the ports aren't
opened properly)
Thanks
Best Regards
On Sat, May 2, 2015 at 12:04 AM,
500GB of data will have nearly 3900 partitions and if you can have nearly
that many number of cores and around 500GB of memory then things will be
lightening fast. :)
Thanks
Best Regards
On Sun, May 3, 2015 at 12:49 PM, sherine ahmed sherine.sha...@hotmail.com
wrote:
I need to use spark to
and block sizes are same, shouldn't we end up with 8k
partitions?
On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:
500GB of data will have nearly 3900 partitions and if you can have nearly
that many number of cores and around 500GB of memory then things will be
lightening fast
It used to exit without any problem for me. You can basically check in the
driver UI (that runs on 4040) and see what exactly its doing.
Thanks
Best Regards
On Fri, May 1, 2015 at 6:22 PM, James Carman ja...@carmanconsulting.com
wrote:
In all the examples, it seems that the spark application
It could be.
Thanks
Best Regards
On Fri, May 1, 2015 at 9:11 PM, roy rp...@njit.edu wrote:
Hi,
I have recently enable log4j.rootCategory=WARN, console in spark
configuration. but after that spark.logConf=True has becomes ineffective.
So just want to confirm if this is because
There was a similar discussion over here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E
Thanks
Best Regards
On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote:
*Resending as I do not
Infact, sparkConf.set(spark.whateverPropertyYouWant,Value) gets shipped
to the executors.
Thanks
Best Regards
On Fri, May 1, 2015 at 2:55 PM, Michael Ryabtsev mich...@totango.com
wrote:
Hi,
We've had a similar problem, but with log4j properties file.
The only working way we've found, was
Just make sure your are having the same version of spark in your cluster
and the project's build file.
Thanks
Best Regards
On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango)
mich...@totango.com wrote:
Hi everyone,
I have a spark application that works fine on a standalone Spark
-memory 12g --executor-cores 4
12G is the limit imposed by YARN cluster, I cant go beyond this.
ANY suggestions ?
Regards,
Deepak
On Thu, Apr 30, 2015 at 6:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
Did not work. Same problem.
On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das ak
This is spark mailing list :/
Yes, you can configure the following in the mapred-site.xml for that:
property
namemapred.tasktracker.map.tasks.maximum/name
value4/value
/property
Thanks
Best Regards
On Tue, Apr 28, 2015 at 11:00 PM, Shushant Arora shushantaror...@gmail.com
wrote:
In
If the data is too huge and is in S3, that'll be a lot of network traffic,
instead, if the data is available in HDFS (with proper replication
available) then it will be faster as most of the time, data will be
available as PROCESS_LOCAL/NODE_LOCAL to the executor.
Thanks
Best Regards
On Wed, Apr
You could try increasing your heap space explicitly. like export
_JAVA_OPTIONS=-Xmx10g, its not the correct approach but try.
Thanks
Best Regards
On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB
Does this speed up?
val rdd = sc.parallelize(1 to 100*, 30)*
rdd.count
Thanks
Best Regards
On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle ans...@betaglide.com
wrote:
Hi,
I'm running the following code in my cluster (standalone mode) via spark
shell -
val rdd = sc.parallelize(1 to
Have a look at KafkaRDD
https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kafka/KafkaRDD.html
Thanks
Best Regards
On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I'm wondering about the use-case where you're not doing continuous,
This is how i used to do it:
- Login to the ec2 cluster (master)
- Make changes to the spark, and build it.
- Stop the old installation of spark (sbin/stop-all.sh)
- Copy old installation conf/* to modified version's conf/
- Rsync modified version to all slaves
- do sbin/start-all.sh from the
You can replace your clusters(on master and workers) assembly jar with your
custom build assembly jar.
Thanks
Best Regards
On Tue, Apr 28, 2015 at 9:45 PM, Bo Fu b...@uchicago.edu wrote:
Hi all,
I have an issue. I added some timestamps in Spark source code and built it
using:
mvn package
One way you could try would be, Inside the map, you can have a synchronized
thread and you can block the map till the thread finishes up processing.
Thanks
Best Regards
On Wed, Apr 29, 2015 at 9:38 AM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
Hi
In a multi-node setup, I am
It is possible to access the filename, its a bit tricky though.
val fstream = ssc.fileStream[LongWritable, IntWritable,
SequenceFileInputFormat[LongWritable,
IntWritable]](/home/akhld/input/)
fstream.foreach(x ={
//You can get it with this object.
How about:
JavaPairDStreamLongWritable, Text input =
jssc.fileStream(inputDirectory, LongWritable.class, Text.class,
TextInputFormat.class);
See the complete example over here
Option B would be fine, as in the SO itself the answer says, Since RDD
transformations merely build DAG descriptions without execution, in Option
A by the time you call unpersist, you still only have job descriptions and
not a running execution.
Also note, In Option A, you are not specifying any
You need to look more deep into your worker logs, you may find GC error, IO
exceptions etc if you look closely which is triggering the timeout.
Thanks
Best Regards
On Mon, Apr 27, 2015 at 3:18 AM, Deepak Gopalakrishnan dgk...@gmail.com
wrote:
Hello Patrick,
Sure. I've posted this on user as
Isn't it already available on the driver UI (that runs on 4040)?
Thanks
Best Regards
On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
I am wondering how should we understand the running time of SparkSQL
queries? For example the physical query plan and the running
Like this?
messages.foreachRDD(rdd = {
if(rdd.count() 0) //Do whatever you want.
})
Thanks
Best Regards
On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
Hi,
I need compare the count of messages recived if is 0 or not, but
messages.count() return a
May be this will give you a good start
https://github.com/apache/spark/pull/2077
Thanks
Best Regards
On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco gibb...@gmail.com
wrote:
Hi,
I would like to know if it is possible to build the DAG before actually
executing the application. My
Make sure you are having =2 core for your streaming application.
Thanks
Best Regards
On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote:
I hit the same issue as if the directory has no files at all when running
the sample examples/src/main/python/streaming/hdfs_wordcount.py
, 2015 at 1:27 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you try writing to a different S3 bucket and confirm that?
Thanks
Best Regards
On Thu, Apr 23, 2015 at 12:11 AM, Daniel Mahler dmah...@gmail.com
wrote:
Hi Akhil,
It works fine when outprefix is a hdfs:///localhost/... url
The directory in ZooKeeper to store recovery state (default: /spark).
-Jeff
From: Sean Owen so...@cloudera.com
To: Akhil Das ak...@sigmoidanalytics.com
Cc: Michal Klos michal.klo...@gmail.com, User user@spark.apache.org
Date: Wed, 22 Apr 2015 11:05:46 +0100
Subject: Re: Multiple HA spark
are in that dir. For me the most confusing thing is
that the executor can actually create HiveConf objects but when it cannot
find that when the task deserializer is at work.
On 20 April 2015 at 14:18, Akhil Das ak...@sigmoidanalytics.com wrote:
Can you try sc.addJar(/path/to/your/hive/jar), i
You can enable this flag to run multiple jobs concurrently, It might not be
production ready, but you can give it a try:
sc.set(spark.streaming.concurrentJobs,2)
Refer to TD's answer here
You can simply use a custom inputformat (AccumuloInputFormat) with the
hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to
pass the jobConfs. Here's pretty clean discussion:
With maven you could like:
mvn -Dhadoop.version=2.3.0 -DskipTests clean package -pl core
Thanks
Best Regards
On Mon, Apr 20, 2015 at 8:10 PM, Shiyao Ma i...@introo.me wrote:
Hi.
My usage is only about the spark core and hdfs, so no spark sql or
mlib or other components invovled.
I saw
It could be a similar issue as
https://issues.apache.org/jira/browse/SPARK-4300
Thanks
Best Regards
On Tue, Apr 21, 2015 at 8:09 AM, donhoff_h 165612...@qq.com wrote:
Hi,
I am studying the RDD Caching function and write a small program to verify
it. I run the program in a Spark1.3.0
I think DStream.transform is the one that you are looking for.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 9:42 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Is the only way to implement a custom partitioning of DStream via the
foreach
approach so to gain access to the actual RDDs comprising
Your spark master should be spark://swetha:7077 :)
Thanks
Best Regards
On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com wrote:
PFA screenshot of my cluster UI
Thanks
On Monday 20 April 2015 02:27 PM, Akhil Das wrote:
Are you seeing your task being submitted to the UI
2015 12:28 PM, Akhil Das wrote:
In your eclipse, while you create your SparkContext, set the master uri
as shown in the web UI's top left corner like: spark://someIPorHost:7077
and it should be fine.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com
try doing a sc.addJar(path\to\your\postgres\jar)
Thanks
Best Regards
On Mon, Apr 20, 2015 at 12:26 PM, shashanksoni shashankso...@gmail.com
wrote:
I am using spark 1.3 standalone cluster on my local windows and trying to
load data from one of our server. Below is my code -
import os
was suspecting some foul play with
classloaders.
On 20 April 2015 at 12:20, Akhil Das ak...@sigmoidanalytics.com wrote:
Looks like a missing jar, try to print the classpath and make sure the
hive jar is present.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma manku.tim
Why not build the project and submit the build jar with Spark submit?
If you want to run it within eclipse, then all you have to do is, create a
SparkContext pointing to your cluster, do a
sc.addJar(/path/to/your/project/jar) and then you can hit the run button
to run the job (note that network
Looks like a missing jar, try to print the classpath and make sure the hive
jar is present.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma manku.tim...@gmail.com
wrote:
I am using spark-1.3 with hadoop-provided and hive-provided and
hive-0.13.1 profiles. I am running a
Would be good, if you can paste your custom receiver code and the code that
you used to invoke it.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 9:43 AM, Ankit Patel patel7...@hotmail.com wrote:
I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart
method is never called on
Which version of Spark are you using? Did you try
using spark.shuffle.blockTransferService=nio
Thanks
Best Regards
On Sat, Apr 18, 2015 at 11:14 PM, roy rp...@njit.edu wrote:
Hi,
My spark job is failing with following error message
org.apache.spark.shuffle.FetchFailedException:
In your eclipse, while you create your SparkContext, set the master uri as
shown in the web UI's top left corner like: spark://someIPorHost:7077 and
it should be fine.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com wrote:
Hi All,
I am new to spark and
Not sure if this will help, but try clearing your jar cache (for sbt
~/.ivy2 and for maven ~/.m2) directories.
Thanks
Best Regards
On Wed, Apr 15, 2015 at 9:33 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Env - Spark 1.3 Hadoop 2.3, Kerbeos
xx.saveAsTextFile(path, codec) gives following
, Akhil Das ak...@sigmoidanalytics.com wrote:
Can you paste your complete code? Did you try repartioning/increasing
level of parallelism to speed up the processing. Since you have 16 cores,
and I'm assuming your 400k records isn't bigger than a 10G dataset.
Thanks
Best Regards
On Thu, Apr 16
There's a version incompatibility between your hadoop jars. You need to
make sure you build your spark with Hadoop 2.5.0-cdh5.3.1 version.
Thanks
Best Regards
On Fri, Apr 17, 2015 at 5:17 AM, lalasriza . lala.s.r...@gmail.com wrote:
Dear everyone,
right now I am working with SparkR on
Hi
With SparkStreaming on 1.3.0 version when I'm using WAL and checkpoints,
sometimes, I'm hitting fileNotFound exceptions.
Here's the complete stacktrace:
https://gist.github.com/akhld/126b945f7fef408a525e
The application simply reads data from Kafka and does a simple wordcount
over it. Batch
I used to hit this issue when my processing time exceeds the batch
duration. Here's a few workarounds:
- Use storage level MEMORY_AND_DISK
- Enable WAL and check pointing
Above two will slow down things a little bit.
If you want low latency, what you can try is:
- Use storage level as
Try increasing your driver memory.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 6:09 PM, sarath sarathkrishn...@gmail.com wrote:
Hi,
I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But
I'm getting java.lang.OutOfMemoryError: Java heap space error. The
dataset
is
...@gmail.com
wrote:
I already checked and G is taking 1 secs for each task. is this too much?
if yes how to avoid this?
On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote:
Open the driver ui and see which stage is taking time, you can look
whether its adding any GC time etc
You could try repartitioning your RDD using a custom partitioner
(HashPartitioner etc) and caching the dataset into memory to speedup the
joins.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I have an RDD that contains
Open the driver ui and see which stage is taking time, you can look whether
its adding any GC time etc.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Hi All I have below code whether distinct is running for more time.
blockingRdd is the
You can simply override the isSplitable method in your custom inputformat
class and make it return false.
Here's a sample code snippet:
http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop#answers-header
Thanks
Best Regards
On Thu, Apr 16, 2015 at 4:18 PM,
You can plug in the native hadoop input formats with Spark's
sc.newApiHadoopFile etc which takes in the inputformat.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 10:15 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Is it for spark?
On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das ak
Is it working without kryo?
Thanks
Best Regards
On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Hi All I am getting below exception while using Kyro serializable with
broadcast variable. I am broadcating a hasmap with below line.
MapLong, MatcherReleventData
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile
Thanks
Best Regards
On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
Just make sure you have atleast 2 cores available for processing. You can
try launching it in local[2] and make sure its working fine.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 11:41 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Hi
I am running a spark streaming application but on
Make sure your yarn service is running on 8032.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 12:35 PM, Vineet Mishra clearmido...@gmail.com
wrote:
Hi Team,
I am running Spark Word Count example(
https://github.com/sryza/simplesparkapp), if I go with master as local it
works fine.
But when
Did you try reducing your spark.executor.memory?
Thanks
Best Regards
On Wed, Apr 15, 2015 at 2:29 PM, Brahma Reddy Battula
brahmareddy.batt...@huawei.com wrote:
Hello Sparkers
I am newbie to spark and need help.. We are using spark 1.2, we are
getting the following error and executor is
Once you start your streaming application to read from Kafka, it will
launch receivers on the executor nodes. And you can see them on the
streaming tab of your driver ui (runs on 4040).
[image: Inline image 1]
These receivers will be fixed till the end of your pipeline (unless its
crashed etc.)
, Akhil,
I would ask a question here: Assume Receiver-0 is crashed, will it be
restarted on other worker nodes(In your picture, there would be 2 receivers
on the same node) or will it start on the same node?
--
bit1...@163.com
*From:* Akhil Das ak
You can use a tachyon based storage for that and everytime the client
queries, you just get it from there.
Thanks
Best Regards
On Mon, Apr 6, 2015 at 6:01 PM, Siddharth Ubale siddharth.ub...@syncoms.com
wrote:
Hi ,
In Spark Web Application the RDD is generating every time client is
You could try leaving all the configuration values to default and running
your application and see if you are still hitting the heap issue, If so try
adding a Swap space to the machines which will definitely help. Another way
would be to set the heap space manually (export _JAVA_OPTIONS=-Xmx5g)
When you say done fetching documents, does it mean that you are stopping
the streamingContext? (ssc.stop) or you meant completed fetching documents
for a batch? If possible, you could paste your custom receiver code so that
we can have a look at it.
Thanks
Best Regards
On Tue, Apr 7, 2015 at
Why are you not using sbin/start-all.sh?
Thanks
Best Regards
On Wed, Apr 8, 2015 at 10:24 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying to start the worker by:
sbin/start-slave.sh spark://ip-10-241-251-232:7077
In the logs it's complaining about:
Master must be a URL of the
One hack you can put in would be to bring Result class
http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hbase/hbase/0.89.20100924-28/org/apache/hadoop/hbase/client/Result.java/?v=source
locally and serialize it (implements serializable) and use it.
Just make sure you import the followings:
import org.apache.spark.SparkContext._
import org.apache.spark.StreamingContext._
Thanks
Best Regards
On Wed, Apr 8, 2015 at 6:38 AM, Su She suhsheka...@gmail.com wrote:
Hello Everyone,
I am trying to implement this example (Spark Streaming with
If you want to use 2g of memory on each worker, you can simply export
SPARK_WORKER_MEMORY=2g inside your spark-env.sh on all machine in the
cluster.
Thanks
Best Regards
On Wed, Apr 8, 2015 at 7:27 AM, Jia Yu jia...@asu.edu wrote:
Hi guys,
Currently I am running Spark program on Amazon EC2.
Can you share a bit more information on the type of application that you
are running? From the stacktrace i can only say, for some reason your
connection timedout (prolly a GC pause or network issue)
Thanks
Best Regards
On Wed, Apr 8, 2015 at 9:48 PM, Shuai Zheng szheng.c...@gmail.com wrote:
That totally depends on your disk IO and the number of CPUs that you have
in the cluster. For example, if you are having a disk IO of 100MB/s and a
handful of CPUs ( say 40 cores, on 10 machines), then it could take you to
~ 1GB/Sec i believe.
Thanks
Best Regards
On Tue, Apr 7, 2015 at 2:48 AM,
We have a similar version (Sigstream), you could find more over here
https://sigmoid.com/
Thanks
Best Regards
On Wed, Apr 8, 2015 at 9:25 AM, haopu hw...@qilinsoft.com wrote:
I'm also interested in this project. Do you have any update on it? Is it
still active?
Thank you!
--
View this
Where exactly is it throwing null pointer exception? Are you starting your
program from another program or something? looks like you are invoking
ProcessingBuilder etc.
Thanks
Best Regards
On Thu, Apr 9, 2015 at 6:46 PM, Somnath Pandeya somnath_pand...@infosys.com
wrote:
JavaRDDString
You can try something like this:
eventsDStream.foreachRDD(rdd = {
val curdate = new DateTime()
val fmt = DateTimeFormat.forPattern(dd_MM_);
rdd.saveAsTextFile(s3n://bucket_name/test/events_+fmt.print(curdate)+/events)
})
Thanks
Best Regards
On Fri, Apr 10, 2015 at 4:22
There's sc.objectFile also.
Thanks
Best Regards
On Tue, Apr 14, 2015 at 2:59 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Can you please share the native support of data formats available with
Spark.
Two i can see are parquet and textFile
sc.parquetFile
sc.textFile
I see that Hadoop
Are you expecting to receive 1 to 100 values in your second program?
RDD is just an abstraction, you would need to do like:
num.foreach(x = send(x))
Thanks
Best Regards
On Mon, Apr 6, 2015 at 1:56 AM, raggy raghav0110...@gmail.com wrote:
For a class project, I am trying to utilize 2 spark
We had few sessions at Sigmoid, you could go through the meetup page for
details:
http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/
On 6 Apr 2015 18:01, Abhideep Chakravarty
abhideep.chakrava...@mindtree.com wrote:
Hi all,
We are here planning to setup a Spark learning
How are you submitting the application? Use a standard build tool like
maven or sbt to build your project, it will download all the dependency
jars, when you submit your application (if you are using spark-submit, then
use --jars option to add those jars which are causing
classNotFoundException).
We've a custom version/build of sparktreaming doing the nested s3 lookups
faster (uses native S3 APIs). You can find the source code over here :
https://github.com/sigmoidanalytics/spark-modified, In particular the
changes from here
I think these are the following configurations that you are looking for:
*spark.locality.wait*: Number of milliseconds to wait to launch a
data-local task before giving up and launching it on a less-local node. The
same wait will be used to step through multiple locality levels
(process-local,
=
SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan Physical Plan ==
java.lang.ClassNotFoundException: json_tuple
Any other suggestions or am I doing something else wrong here?
-Todd
On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try adding all
Did you try these?
- Disable shuffle : spark.shuffle.spill=false
- Enable log rotation:
sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)
Thanks
Best Regards
On Fri, Apr 3, 2015
:34 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you try building Spark
https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support
with hive support? Before that try to run the following:
./bin/spark-shell --master
There isn't any specific Linux distro, but i would prefer Ubuntu for a
beginner as its very easy to apt-get install stuffs on it.
Thanks
Best Regards
On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias tobias.horsm...@uni-due.de
wrote:
Hi,
Are there any recommendations for operating systems
This thread might give you some insights
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
Thanks
Best Regards
On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
My Spark Job
iPhone
On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:
I was able to write record that extends specificrecord (avro) this class
was not auto generated. Do we need to do something extra for auto generated
classes
Sent from my iPhone
On 03-Apr-2015, at 5:06 pm, Akhil Das ak
Try adding all the jars in your $HIVE/lib directory. If you want the
specific jar, you could look fr jackson or json serde in it.
Thanks
Best Regards
On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote:
I have a feeling I’m missing a Jar that provides the support or could this
It will add scheduling delay for the new batch. The new batch data will be
processed after finish up the previous batch, when the time is too high,
sometimes it will throw fetch failures as the batch data could get removed
from memory.
Thanks
Best Regards
On Wed, Apr 1, 2015 at 11:35 AM,
Once you submit the job do a ps aux | grep spark-submit and see how much is
the heap space allocated to the process (the -Xmx params), if you are
seeing a lower value you could try increasing it yourself with:
export _JAVA_OPTIONS=-Xmx5g
Thanks
Best Regards
On Wed, Apr 1, 2015 at 1:57 AM, Shuai
Error 23 is defined as a partial transfer and might be caused by
filesystem incompatibilities, such as different character sets or access
control lists. In this case it could be caused by the double slashes (// at
the end of sbin), You could try editing your sbin/spark-daemon.sh file,
look for
You can add an internal ip to public hostname mapping in your /etc/hosts
file, if your forwarding is proper then it wouldn't be a problem there
after.
Thanks
Best Regards
On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com wrote:
Hi,
For security reasons, we added a server between
Its pretty simple, pick one machine as master (say machine A), and lets
call the workers are B,C, and D
*Login to A:*
- Enable passwd less authentication (ssh-keygen)
- Add A's ~/.ssh/id_rsa.pub to B,C,D's ~/.ssh/authorized_keys file
- Download spark binary (that supports your hadoop
the spark-env.sh file?
Thanks!
Anny
On Mon, Mar 30, 2015 at 11:15 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can add an internal ip to public hostname mapping in your /etc/hosts
file, if your forwarding is proper then it wouldn't be a problem there
after.
Thanks
Best Regards
On Tue
What happens when you do:
sc.textFile(hdfs://path/to/the_file.txt)
Thanks
Best Regards
On Mon, Mar 30, 2015 at 11:04 AM, Nick Travers n.e.trav...@gmail.com
wrote:
Hi List,
I'm following this example here
https://github.com/databricks/learning-spark/tree/master/mini-complete-example
Are you having enough messages in kafka to consume? Can you make sure you
kafka setup is working with your console consumer? Also try this example
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala
Thanks
601 - 700 of 1302 matches
Mail list logo