?
On Tue, Jun 9, 2015 at 1:07 PM, amit tewari amittewar...@gmail.com
wrote:
Thanks Akhil, as you suggested, I have to go keyBy(route) as need the
columns intact.
But wil keyBy() take accept multiple fields (eg x(0), x(1))?
Thanks
Amit
On Tue, Jun 9, 2015 at 12:26 PM, Akhil Das ak
downloaded the application jar but not
the other jars specified by “—jars”.
Or I misunderstand the usage of “--jars”, and the jars should be already
in every worker, driver will not download them?
Is there some useful docs?
Thanks
Dong Lei
*From:* Akhil Das [mailto:ak
Regards
On Tue, Jun 9, 2015 at 2:09 PM, luohui20...@sina.com wrote:
Only 1 minor GC, 0.07s.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: How
Try this way:
scalaval input1 = sc.textFile(/test7).map(line =
line.split(,).map(_.trim));
scalaval input2 = sc.textFile(/test8).map(line =
line.split(,).map(_.trim));
scalaval input11 = input1.map(x=(*(x(0) + x(1)*),x(2),x(3)))
scalaval input22 = input2.map(x=(*(x(0) + x(1)*),x(2),x(3)))
scala
like this?
myDStream.foreachRDD(rdd = rdd.saveAsTextFile(/sigmoid/, codec ))
Thanks
Best Regards
On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro rcors...@gmail.com wrote:
It looks like saveAsTextFiles doesn't support the compression parameter of
RDD.saveAsTextFile. Is there a way to add the
May be you should check in your driver UI and see if there's any GC time
involved etc.
Thanks
Best Regards
On Mon, Jun 8, 2015 at 5:45 PM, luohui20...@sina.com wrote:
hi there
I am trying to descrease my app's running time in worker node. I
checked the log and found the most
Once you submits the application, you can check in the driver UI (running
on port 4040) Environment Tab to see whether those jars you added got
shipped or not. If they are shipped and still you are getting NoClassDef
exceptions then it means that you are having a jar conflict which you can
resolve
Can you look in your worker logs for more detailed stack-trace? If its
about winutils.exe you can look at these links to get it resolved.
- http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7
- https://issues.apache.org/jira/browse/SPARK-2356
Thanks
Best Regards
On Mon, Jun 8,
it just lets me straight in.
On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure
its able to login without password?
Thanks
Best Regards
On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its
able to login without password?
Thanks
Best Regards
On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin...@gmail.com wrote:
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker)
These two hosts have
It could be a CPU, IO, Network bottleneck, you need to figure out where
exactly its chocking. You can use certain monitoring utilities (like top)
to understand it better.
Thanks
Best Regards
On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:
Hi All,
I have a Spark
Another approach would be to use a zookeeper. If you have zookeeper
running somewhere in the cluster you can simply create a path like
*/dynamic-list* in it and then write objects/values to it, you can even
create/access nested objects.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 7:06 PM,
Which consumer are you using? If you can paste the complete code then may
be i can try reproducing it.
Thanks
Best Regards
On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote:
And here is the Thread Dump, where seems every worker is waiting for
Executor
#6 Thread 95:
You could try adding the configuration in the spark-defaults.conf file. And
once you run the application you can actually check on the driver UI (runs
on 4040) Environment tab to see if the configuration is set properly.
Thanks
Best Regards
On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
efficient if your output data is huge since one task will be doing the
whole writing.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote:
Hi all
I'm running spark in a single local
Hi
Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183
[image: Inline image 1]
Thanks
Best Regards
On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I've got a Spark Streaming driver job implemented and in it, I register a
streaming
Replace this line:
img_data = sc.parallelize( list(im.getdata()) )
With:
img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have )
Thanks
Best Regards
On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur jmspar...@gmail.com wrote:
Hi all,
I'm playing around with
That's because you need to add the master's public key (~/.ssh/id_rsa.pub)
to the newly added slaves ~/.ssh/authorized_keys.
I add slaves this way:
- Launch a new instance by clicking on the slave instance and choose *launch
more like this *
*- *Once its launched, ssh into it and add the master
is, Is there an
alternate api though which a spark application can be launched which can
return a exit status back to the caller as opposed to initiating JVM halt.
On Wed, Jun 3, 2015 at 12:58 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Run it as a standalone application. Create an sbt project
You need to look into your executor/worker logs to see whats going on.
Thanks
Best Regards
On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no
wrote:
Hi,
What can be the cause of this ERROR cluster.YarnScheduler: Lost executor?
How can I fix it?
Best,
Patcharee
I think when you do a ssc.stop it will stop your entire application and by
update a transformation function you mean modifying the driver program?
In that case even if you checkpoint your application, it won't be able to
recover from its previous state.
A simpler approach would be to add certain
Run it as a standalone application. Create an sbt project and do sbt run?
Thanks
Best Regards
On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri
pavan.kolam...@gmail.com wrote:
Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs.
But for my use case i don't want it
)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
Best,
Patcharee
On 03. juni 2015 09:21, Akhil Das wrote:
You need to look into your executor/worker logs to see whats going on.
Thanks
Best Regards
On Wed, Jun 3, 2015 at 12:01 PM, patcharee
If you want to submit applications to a remote cluster where your port 7077
is opened publically, then you would need to set the *spark.driver.host *(with
the public ip of your laptop) and *spark.driver.port* (optional, if there's
no firewall between your laptop and the remote cluster). Keeping
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think
StorageLevel MEMORY_AND_DISK means spark will try to keep the data in
memory and if there isn't sufficient space then it will be shipped to the
disk.
Thanks
Best Regards
On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and
then sbin/start-dfs.sh
Thanks
Best Regards
On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote:
Hello All,
A bit scared I did
I found an interesting presentation
http://www.slideshare.net/colorant/spark-shuffle-introduction and go
through this thread also
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
Thanks
Best Regards
On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
You can try to skip the tests, try with:
mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package
Thanks
Best Regards
On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch java...@gmail.com wrote:
I downloaded the 1.3.1 distro tarball
$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve staff
You can run/submit your code from one of the worker which has access to the
file system and it should be fine i think. Give it a try.
Thanks
Best Regards
On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar pradyumna.ac...@gmail.com
wrote:
Hello!
I have Spark running in standalone mode, and there
Basically, you need to convert it to a serializable format before doing the
collect/take.
You can fire up a spark shell and paste this:
val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence
/sigmoid)
*.map(_._2.toString)*
sFile.take(5).foreach(println)
Use the
This thread
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
has various methods on accessing S3 from spark, it might help you.
Thanks
Best Regards
On Sun, May 24, 2015 at 8:03 AM, ogoh oke...@gmail.com wrote:
Hello,
I am
May be you can make use of the Window operations
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#window-operations,
Also another approach would be to keep your incoming data in
Hbase/Redis/Cassandra kind of database and then whenever you need to
average it, you just query the
Here's a more detailed documentation
https://github.com/datastax/spark-cassandra-connector from Datastax, You
can also shoot an email directly to their mailing list
http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
since its more related to their code.
Thanks
Best
I do this way:
- Launch a new instance by clicking on the slave instance and choose *launch
more like this *
*- *Once its launched, ssh into it and add the master public key to
.ssh/authorized_keys
- Add the slaves internal IP to the master's conf/slaves file
- do sbin/start-all.sh and it will
MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I
don't think Kafka messages will be cached in driver.
On Thu, May 28, 2015 at 12:24 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Are you using the createStream or createDirectStream api? If its the
former, you can try setting
aggressive? Let's say I have 20
slaves up, and I want to add one more, why should we stop the entire
cluster for this?
thanks, nizan
On Thu, May 28, 2015 at 10:19 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I do this way:
- Launch a new instance by clicking on the slave instance
You can use python boto library for that, in fact spark-ec2 script uses it
underneath. Here's the
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L706 call
spark-ec2 is making to get all machines under a given security group.
Thanks
Best Regards
On Thu, May 28, 2015 at 2:22 PM,
=
rdd.foreachPartition { iter =
iter.foreach(count = logger.info(count.toString))
}
}
It receives messages from Kafka, parse the json, filter and count the
records, and then print it to logs.
Thanks.
On Thu, May 28, 2015 at 3:07 PM, Akhil Das ak
After submitting the job, if you do a ps aux | grep spark-submit then you
can see all JVM params. Are you using the highlevel consumer (receiver
based) for receiving data from Kafka? In that case if your throughput is
high and the processing delay exceeds batch interval then you will hit this
How about creating two and union [ sc.union(first, second) ] them?
Thanks
Best Regards
On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have this piece
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](
files like result1.txt,result2.txt...result21.txt.
Sorry for not adding some comments for my code.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: Re
Try these:
- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:
sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)
You can also look into
;Best regards!
San.Luo
- 原始邮件 -
发件人:madhu phatak phatak@gmail.com
收件人:luohui20...@sina.com
抄送人:Akhil Das ak...@sigmoidanalytics.com, user user@spark.apache.org
主题:Re: Re: how to distributed run a bash shell in spark
日期:2015年05月25日 14点11分
Hi,
You can use pipe operator, if you
Try this way:
object Holder extends Serializable { @transient lazy val log =
Logger.getLogger(getClass.getName)}
val someRdd = spark.parallelize(List(1, 2, 3))
someRdd.map {
element =
Holder.*log.info http://log.info/(s$element will be processed)*
element + 1
If you want to notify after every batch is completed, then you can simply
implement the StreamingListener
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener
interface, which has methods like onBatchCompleted, onBatchStarted etc in
which
Hi Kevin,
Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark
failed for me when i use just the ipv6 addresses, but it works fine when i
use the host names.
Here's an entry in my /etc/hosts:
2607:5300:0100:0200::::0a4d hacked.work
My spark-env.sh file:
You mean you want to execute some shell commands from spark? Here's
something i tried a while back. https://github.com/akhld/spark-exploit
Thanks
Best Regards
On Sun, May 24, 2015 at 4:53 PM, luohui20...@sina.com wrote:
hello there
I am trying to run a app in which part of it needs to
I used to hit a NPE when i don't add all the dependency jars to my context
while running it in standalone mode. Can you try adding all these
dependencies to your context?
sc.addJar(/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.3.1.jar)
You can look at the logic for offloading data from Memory by looking at
ensureFreeSpace
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L416
call.
And dropFromMemory
without problem.
Best Regards,
Allan
On 21 May 2015 at 01:30, Akhil Das ak...@sigmoidanalytics.com wrote:
This is more like an issue with your HDFS setup, can you check in the
datanode logs? Also try putting a new file in HDFS and see if that works.
Thanks
Best Regards
On Wed, May 20
How many part files are you having? Did you try re-partitioning to a
smaller number so that you will have bigger files of smaller number.
Thanks
Best Regards
On Wed, May 20, 2015 at 3:06 AM, Richard Grossman richie...@gmail.com
wrote:
Hi
I'm using spark 1.3.1 and now I can't set the size of
textFile does reads all files in a directory.
We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
This thread happened a year back, can you please share what issue you are
facing? which version of spark you are using? What is your system
environment? Exception stack-trace?
Thanks
Best Regards
On Thu, May 21, 2015 at 12:19 PM, Keerthi keerthi.reddy1...@gmail.com
wrote:
Hi ,
I had tried
* and edit *Path* Variable to
add *bin* directory of *HADOOP_HOME* (say*C:\hadoop\bin*).
fix this issue in my env
2015-05-21 9:55 GMT+03:00 Akhil Das ak...@sigmoidanalytics.com:
This thread happened a year back, can you please share what issue you are
facing? which version of spark you are using
Can you try commenting the saveAsTextFile and do a simple count()? If its a
broadcast issue, then it would throw up the same error.
On 21 May 2015 14:21, allanjie allanmcgr...@gmail.com wrote:
Sure, the code is very simple. I think u guys can understand from the main
function.
public class
-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
On Tue, May 19, 2015 at 8:10 PM, Shushant Arora
shushantaror...@gmail.com wrote:
So for Kafka+spark streaming, Receiver based streaming used
of JavaPairRDD is as expected. It is when we are calling
collect() or toArray() methods, the exception is coming.
Something to do with Text class even though I haven't used it in the
program.
Regards
Tapan
On Tue, May 19, 2015 at 6:26 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try something
the rate with
spark.streaming.kafka.maxRatePerPartition)
Read more here
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
On Wed, May 20, 2015 at 12:36 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
One receiver basically runs on 1 core, so if your single node is having
-files
Regards
Tapan
On Wed, May 20, 2015 at 12:42 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
If you can share the complete code and a sample file, may be i can try to
reproduce it on my end.
Thanks
Best Regards
On Wed, May 20, 2015 at 7:00 AM, Tapan Sharma tapan.sha
This is more like an issue with your HDFS setup, can you check in the
datanode logs? Also try putting a new file in HDFS and see if that works.
Thanks
Best Regards
On Wed, May 20, 2015 at 11:47 AM, allanjie allanmcgr...@gmail.com wrote:
Hi All,
The variable I need to broadcast is just 468
Yes, this is the user group. Feel free to ask your questions in this list.
Thanks
Best Regards
On Wed, May 20, 2015 at 5:58 AM, Ricardo Goncalves da Silva
ricardog.si...@telefonica.com wrote:
Hi
I'm learning spark focused on data and machine learning. Migrating from
SAS.
There is a group
Hi Justin,
Can you try with sbt, may be that will help.
- Install sbt for windows
http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Windows.html
- Create a lib directory in your project directory
- Place these jars in it:
- spark-streaming-twitter_2.10-1.3.1.jar
-
Hi Peer,
If you open the driver UI (running on port 4040) you can see the stages and
the tasks happening inside it. Best way to identify the bottleneck for a
stage is to see if there's any time spending on GC, and how many tasks are
there per stage (it should be a number total # cores to achieve
There were some similar discussion happened on JIRA
https://issues.apache.org/jira/browse/SPARK-3633 may be that will give you
some insights.
Thanks
Best Regards
On Mon, May 18, 2015 at 10:49 PM, zia_kayani zia.kay...@platalytics.com
wrote:
Hi, I'm getting this exception after shifting my code
It will be a single job running at a time by default (you can also
configure the spark.streaming.concurrentJobs to run jobs parallel which is
not recommended to put in production).
Now, your batch duration being 1 sec and processing time being 2 minutes,
if you are using a receiver based
not be started at its desired interval.
And Whats the difference and usage of Receiver vs non-receiver based
streaming. Is there any documentation for that?
On Tue, May 19, 2015 at 1:35 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
It will be a single job running at a time by default (you can also
in result.
Deciding where to save offsets (or not) is up to you. You can checkpoint,
or store them yourself.
On Mon, May 18, 2015 at 12:00 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I have played a bit with the directStream kafka api. Good work cody.
These are my findings and also can you
specify the number of receiver that you want to spawn for
consuming the messages.
On Tue, May 19, 2015 at 2:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
spark.streaming.concurrentJobs takes an integer value, not boolean. If
you set it as 2 then 2 jobs will run parallel. Default value
Try something like:
JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir,
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class,
IntWritable.class,
Text.class, new Job().getConfiguration());
With the type of input format that you require.
Thanks
Best
assure you that at least as of Spark Streaming 1.2.0,
as Evo says Spark Streaming DOES crash in “unceremonious way” when the free
RAM available for In Memory Cashed RDDs gets exhausted
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Monday, May 18, 2015 2:03 PM
*To:* Evo Eftimov
of the condition is:
Loss was due to java.lang.Exception
java.lang.Exception: *Could not compute split, block*
*input-4-1410542878200 not found*
*From:* Evo Eftimov [mailto:evo.efti...@isecc.com]
*Sent:* Monday, May 18, 2015 12:13 PM
*To:* 'Dmitry Goldenberg'; 'Akhil Das'
*Cc:* 'user@spark.apache.org
are processed is order ( and offsets commits in
order ) .. etc ..
So whoever use whichever consumer need to study pros and cons of both
approach before taking a call ..
Regards,
Dibyendu
On Tue, May 12, 2015 at 8:10 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi Cody,
I was just
Streaming does “NOT” crash UNCEREMNOUSLY – please maintain
responsible and objective communication and facts
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Monday, May 18, 2015 2:28 PM
*To:* Evo Eftimov
*Cc:* Dmitry Goldenberg; user@spark.apache.org
*Subject:* Re: Spark
Why not use sparkstreaming to do the computation and dump the result
somewhere in a DB perhaps and take it from there?
Thanks
Best Regards
On Mon, May 18, 2015 at 7:51 PM, juandasgandaras juandasganda...@gmail.com
wrote:
Hello,
I would like to use spark streaming over a REST api to get
Did you try --executor-cores param? While you submit the job, do a ps aux |
grep spark-submit and see the exact command parameters.
Thanks
Best Regards
On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi,
I have a 5 nodes yarn cluster, I used spark-submit to submit
I think you can try this way also:
DataFrame df =
sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro,
com.databricks.spark.avro);
Thanks
Best Regards
On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com wrote:
Thanks for the suggestion Steve. I'll try that out.
Why not just trigger your batch job with that event?
If you really need streaming, then you can create a custom receiver and
make the receiver sleep till the event has happened. That will obviously
run your streaming pipelines without having any data to process.
Thanks
Best Regards
On Fri, May
With file timestamp, you can actually see the finding new files logic from
here
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172
Thanks
Best Regards
On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy
With receiver based streaming, you can actually
specify spark.streaming.blockInterval which is the interval at which the
receiver will fetch data from the source. Default value is 200ms and hence
if your batch duration is 1 second, it will produce 5 blocks of data. And
yes, with sparkstreaming
What do you mean by not detected? may be you forgot to trigger some action
on the stream to get it executed. Like:
val list_join_action_stream = ssc.fileStream[LongWritable, Text,
TextInputFormat](gc.input_dir, (t: Path) = true, false).map(_._2.toString)
*list_join_action_stream.count().print()*
Did you happened to have a look at the spark job server?
https://github.com/ooyala/spark-jobserver Someone wrote a python wrapper
https://github.com/wangqiang8511/spark_job_manager around it, give it a
try.
Thanks
Best Regards
On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW
Can you share the client code that you used to send the data? May be this
discussion would give you some insights
http://apache-avro.679487.n3.nabble.com/Avro-RPC-Python-to-Java-isn-t-working-for-me-td4027454.html
Thanks
Best Regards
On Thu, May 14, 2015 at 8:44 AM, 鹰 980548...@qq.com wrote:
at 1:04 PM, lisendong lisend...@163.com wrote:
I have action on DStream.
because when I put a text file into the hdfs, it runs normally, but if I
put a lz4 file, it does nothing.
在 2015年5月14日,下午3:32,Akhil Das ak...@sigmoidanalytics.com 写道:
What do you mean by not detected? may be you forgot
Have a look https://spark.apache.org/community.html
Send an email to user-unsubscr...@spark.apache.org
Thanks
Best Regards
On Thu, May 14, 2015 at 1:08 PM, Saurabh Agrawal saurabh.agra...@markit.com
wrote:
How do I unsubscribe from this mailing list please?
Thanks!!
Regards,
:
LzoTextInputFormat where is this class?
what is the maven dependency?
在 2015年5月14日,下午3:40,Akhil Das ak...@sigmoidanalytics.com 写道:
That's because you are using TextInputFormat i think, try
with LzoTextInputFormat like:
val list_join_action_stream = ssc.fileStream[LongWritable, Text
With this lowlevel Kafka API
https://github.com/dibbhatt/kafka-spark-consumer/, you can actually
specify how many receivers that you want to spawn and most of the time it
spawns evenly, usually you can put a sleep just after creating the context
for the executors to connect to the driver and then
This article http://www.virdata.com/tuning-spark/ gives you a pretty good
start on the Spark streaming side. And this article
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
is for the kafka, it has nice explanation how message size and
I believe fileStream would pickup the new files (may be you should increase
the batch duration). You can see the implementation details for finding new
files from here
I found two examples Java version
https://github.com/deepakkashyap/Spark-Streaming-with-RabbitMQ-/blob/master/example/Spark_project/CustomReceiver.java,
and Scala version. https://github.com/d1eg0/spark-streaming-toy
Thanks
Best Regards
On Tue, May 12, 2015 at 2:31 AM, dgoldenberg
Mesos has a HA option (of course it includes zookeeper)
Thanks
Best Regards
On Tue, May 12, 2015 at 4:53 PM, James King jakwebin...@gmail.com wrote:
I know that it is possible to use Zookeeper and File System (not for
production use) to achieve HA.
Are there any other options now or in the
Are you using checkpointing/WAL etc? If yes, then it could be blocking on
disk IO.
Thanks
Best Regards
On Mon, May 11, 2015 at 10:33 PM, Seyed Majid Zahedi zah...@cs.duke.edu
wrote:
Hi,
I'm running TwitterPopularTags.scala on a single node.
Everything works fine for a while (about 30min),
Yep, you can try this lowlevel Kafka receiver
https://github.com/dibbhatt/kafka-spark-consumer. Its much more
flexible/reliable than the one comes with Spark.
Thanks
Best Regards
On Tue, May 12, 2015 at 5:15 PM, James King jakwebin...@gmail.com wrote:
What I want is if the driver dies for some
before that, only took it down to change
code.
http://tinypic.com/r/2e4vkht/8
Regarding flexibility, both of the apis available in spark will do what
James needs, as I described.
On Tue, May 12, 2015 at 8:55 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi Cody,
If you are so sure
wrote:
Very nice! will try and let you know, thanks.
On Tue, May 12, 2015 at 2:25 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yep, you can try this lowlevel Kafka receiver
https://github.com/dibbhatt/kafka-spark-consumer. Its much more
flexible/reliable than the one comes with Spark
Try SparkConf.set(spark.akka.extensions,Whatever), underneath i think
spark won't ship properties which don't start with spark.* to the executors.
Thanks
Best Regards
On Mon, May 11, 2015 at 8:33 AM, Terry Hole hujie.ea...@gmail.com wrote:
Hi all,
I'd like to monitor the akka using kamon,
Did you try repartitioning? You might end up with a lot of time spending on
GC though.
Thanks
Best Regards
On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar vijaypawnar...@gmail.com
wrote:
I am using the Spark Cassandra connector to work with a table with 3
million records. Using .where() API
Have a look over here https://storm.apache.org/community.html
Thanks
Best Regards
On Sun, May 10, 2015 at 3:21 PM, anshu shukla anshushuk...@gmail.com
wrote:
http://stackoverflow.com/questions/30149868/generate-events-tuples-using-csv-file-with-timestamps
--
Thanks Regards,
Anshu Shukla
Have a look at this SO
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
question,
it has discussion on various ways of accessing S3.
Thanks
Best Regards
On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote:
Hi
Whats your usecase and what are you trying to achieve? May be there's a
better way of doing it.
Thanks
Best Regards
On Fri, May 8, 2015 at 10:20 AM, Richard Alex Hofer rho...@andrew.cmu.edu
wrote:
Hi,
I'm working on a project in Spark and am trying to understand what's going
on. Right now to
Since its loading 24 records, it could be that your CSV is corrupted? (may
be the new line char isn't \n, but \r\n if it comes from a windows
environment. You can check this with *cat -v yourcsvfile.csv | more*).
Thanks
Best Regards
On Fri, May 8, 2015 at 11:23 AM, luohui20...@sina.com wrote:
501 - 600 of 1302 matches
Mail list logo