Thanks for the answer but this seems to apply for files that are havin a
key-value structure which I currently don't have. My file is a generic
binary file encoding data from sensors over time. I am just looking at
recreating some objects by assigning splits (ie continuous chunks of bytes)
to each
The question is really whether all the third-party integrations should
be built into Spark's main assembly. I think reasonable people could
disagree, but I think the current state (not built in) is reasonable.
It means you have to bring the integration with you.
That is, no, third-party queue
This article http://www.virdata.com/tuning-spark/ gives you a pretty good
start on the Spark streaming side. And this article
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
is for the kafka, it has nice explanation how message size and
Thanks for explaining Sean and Cody, this makes sense now. I'd like to
help improve this documentation so other python users don't run into the
same thing, so I'll look into that today.
On Tue, May 12, 2015 at 9:44 AM Cody Koeninger c...@koeninger.org wrote:
One of the packages just contains
Great! Reza
On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey rbol...@gmail.com wrote:
Hi Reza,
That was the fix we needed. After sorting, the transposed entries are gone!
Thanks a bunch,
rick
On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh r...@databricks.com wrote:
Hi Richard,
One reason
I think Java-land users will understand to look for an assembly jar in
general, but it's not as obvious outside the Java ecosystem. Assembly
= this thing, plus all its transitive dependencies.
No, there is nothing wrong with Kafka at all. You need to bring
everything it needs for it to work at
@Vadim What happened when you tried unioning using DStream.union in python?
TD
On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote:
I can confirm it does work in Java
*From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
*Sent:* Tuesday, May 12, 2015 5:53 PM
It's the import statement Olivier showed that makes the method available.
Note that you can also use `sc.createDataFrame(myRDD)`, without the need
for the import statement. I personally prefer this approach.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
@TD I kept getting an empty RDD (i.e. rdd.take(1) was False).
ᐧ
On Tue, May 12, 2015 at 12:57 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
@Vadim What happened when you tried unioning using DStream.union in python?
TD
On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov
Hi all,
Just to let you know that we've released a new version of our (Scala) Spark
Example Project. It now targets Spark 1.3.0 and has much cleaner Elastic
MapReduce support using boto/invoke:
http://snowplowanalytics.com/blog/2015/05/10/spark-example-project-0.3.0-released/
Hope it's of
I can confirm it does work in Java
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Tuesday, May 12, 2015 5:53 PM
To: Evo Eftimov
Cc: Saisai Shao; user@spark.apache.org
Subject: Re: DStream Union vs. StreamingContext Union
Thanks Evo. I tried chaining Dstream unions like
Thanks again for all the help folks.
I can confirm that simply switching to `--packages
org.apache.spark:spark-streaming-kafka-assembly_2.10:1.3.1` makes
everything work as intended.
I'm not sure what the difference is between the two packages honestly, or
why one should be used over the other,
Hi Olivier,
Here it is.
== Physical Plan ==
Aggregate false, [PartialGroup#155], [PartialGroup#155 AS
is_bad#108,Coalesce(SUM(PartialCount#152L),0) AS
count#109L,(CAST(SUM(PartialSum#153), DoubleType) /
CAST(SUM(PartialCount#154L), DoubleType)) AS avg#110]
Exchange (HashPartitioning
val trainRDD = rawTrainData.map( rawRow = Row( rawRow.split(,)
.map(_.toInt) ) )
The above is creating a Row with a single column that contains a sequence.
You need to extract the sequence using varargs:
val trainRDD = rawTrainData.map( rawRow = Row( rawRow.split(,)
.map(_.toInt): _* ))
You
Hi folks, I'm trying to use Automatic partition discovery as descibed here:
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
/data/year=2014/file.parquet/data/year=2015/file.parquet
…
SELECT * FROM table WHERE year = 2015
I have an official 1.3.1 CDH4
Hello,
I tryed running GraphX Pregel for single source shortest path(very similar
to example in
https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api)
using around 17K vertices and 36K edges. On a simple 8 vertex, 10 edge graph
the Pregel algorithm works very well. When I
Thanks Saisai. That makes sense. Just seems redundant to have both.
ᐧ
On Mon, May 11, 2015 at 10:36 PM, Saisai Shao sai.sai.s...@gmail.com
wrote:
DStream.union can only union two DStream, one is itself. While
StreamingContext.union can union an array of DStreams, internally
DStream.union is a
Thanks Evo. I tried chaining Dstream unions like what you have and it
didn't work for me. But passing
multiple arguments to StreamingContext.union worked fine. Any idea why? I
am using Python, BTW.
ᐧ
On Tue, May 12, 2015 at 12:45 PM, Evo Eftimov evo.efti...@isecc.com wrote:
You can also union
I wonder that may be a bug in the Python API. Please file it as a JIRA
along with sample code to reproduce it and sample output you get.
On Tue, May 12, 2015 at 10:00 AM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
@TD I kept getting an empty RDD (i.e. rdd.take(1) was False).
ᐧ
On
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will
produce the same shuffle.
I am looking for a way to set this same random seed once on each worker. Is
there any simple way to do it??
Assume that i had several mathines with 8cores , 1 core per work with 8 workers
, 8 cores per work with 1 work , which one is better ?
Hi, there
Which version are you using ? Actually the problem seems gone after we change
our spark version from 1.2.0 to 1.3.0
Not sure what the internal changes did.
Best,
Sun.
fightf...@163.com
From: Night Wolf
Date: 2015-05-12 22:05
To: fightf...@163.com
CC: Patrick Wendell; user; dev
Dear list,
I am new to spark, and I want to use the kmeans algorithm in mllib package.
I am wondering whether it is possible to customize the distance measure used
by kmeans, and how?
Many thanks!
June
I believe fileStream would pickup the new files (may be you should increase
the batch duration). You can see the implementation details for finding new
files from here
I dont know how to simulate such type of input for even spark .
On Tue, May 12, 2015 at 3:02 PM, Steve Loughran ste...@hortonworks.com
wrote:
I think you may want to try emailing things to the storm users list, not
the spark one
On 11 May 2015, at 15:42, Tyler Mitchell
Thanks Akhil,
I'm using Spark in standalone mode so i guess Mesos is not an option here.
On Tue, May 12, 2015 at 1:27 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Mesos has a HA option (of course it includes zookeeper)
Thanks
Best Regards
On Tue, May 12, 2015 at 4:53 PM, James King
I found two examples Java version
https://github.com/deepakkashyap/Spark-Streaming-with-RabbitMQ-/blob/master/example/Spark_project/CustomReceiver.java,
and Scala version. https://github.com/d1eg0/spark-streaming-toy
Thanks
Best Regards
On Tue, May 12, 2015 at 2:31 AM, dgoldenberg
Hi,
I have a SQL query on tables containing big Map columns (thousands of
keys). I found it to be very slow.
select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as
avg
from test
where date between '2014-04-01' and '2014-04-30'
group by meta['is_bad']
=
Hi User Group,
I’m trying to reproduce the example on Spark SQL Programming Guide
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
and got a compile error when packaging with sbt:
[error] myfile.scala:30: value toDF is not a member of
I've been querying Zookeeper directly via the Zookeeper client tools, it has
the ip of the current master leader in the master_status data. We are also
running Exhibitor for zookeeper which has a nice UI for exploring if you want
to look up manually
Thanks,
Michal
On May 12, 2015, at 1:28
Mesos has a HA option (of course it includes zookeeper)
Thanks
Best Regards
On Tue, May 12, 2015 at 4:53 PM, James King jakwebin...@gmail.com wrote:
I know that it is possible to use Zookeeper and File System (not for
production use) to achieve HA.
Are there any other options now or in the
Are you using checkpointing/WAL etc? If yes, then it could be blocking on
disk IO.
Thanks
Best Regards
On Mon, May 11, 2015 at 10:33 PM, Seyed Majid Zahedi zah...@cs.duke.edu
wrote:
Hi,
I'm running TwitterPopularTags.scala on a single node.
Everything works fine for a while (about 30min),
Hi,
In spark on yarn and when running spark_shuffle as auxiliary service on
node manager, does map spills of a stage gets cleaned up once the next
stage completes OR
is it preserved till the app completes(ie waits for all the stages to
complete) ?
--
Thanks,
Ashwin
HI,
I tested the following in my streaming app and hoped to get an approximate
count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed
to always return after it finishes completely, just like rdd.count(), which
often exceeded 5 seconds. The values for low, mean, and high
Dear all,
We’re organizing a meetup http://www.meetup.com/Tachyon/events/222485713/ on
May 28th at IBM in Forster City that might be of interest to the Spark
community. The focus is a production use case of Spark and Tachyon at Baidu.
You can sign up here:
Hi, is Content based filtering available for Spark in Mllib? If it isn't ,
what can I use as an alternative? Thank you.
Have a nice day
yasemin
--
hiç ender hiç
Thanks!. We can somewhat approximate number of rows returned by where(), as
a result we can approximate number of partitions, so repartition approach
will work.
Lets say if the .where() had resulted in widel varying number of rows, we
would not have been to approximate # of partition, that would
What I want is if the driver dies for some reason and it is restarted I
want to read only messages that arrived into Kafka following the restart of
the driver program and re-connection to Kafka.
Has anyone done this? any links or resources that can help explain this?
Regards
jk
Content based filtering is a pretty broad term - do you have any particular
approach in mind?
MLLib does not have any purely content-based methods. Your main alternative is
ALS collaborative filtering.
However, using a system like Oryx / PredictionIO / elasticsearch etc you can
combine
Yep, you can try this lowlevel Kafka receiver
https://github.com/dibbhatt/kafka-spark-consumer. Its much more
flexible/reliable than the one comes with Spark.
Thanks
Best Regards
On Tue, May 12, 2015 at 5:15 PM, James King jakwebin...@gmail.com wrote:
What I want is if the driver dies for some
Very nice! will try and let you know, thanks.
On Tue, May 12, 2015 at 2:25 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yep, you can try this lowlevel Kafka receiver
https://github.com/dibbhatt/kafka-spark-consumer. Its much more
flexible/reliable than the one comes with Spark.
Thanks
Thanks, Akhil. It looks like in the second example, for Rabbit they're
doing this: https://www.rabbitmq.com/mqtt.html.
On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I found two examples Java version
From the code it seems that as soon as the rdd.countApprox(5000)
returns, you can call pResult.initialValue() to get the approximate count
at that point of time (that is after timeout). Calling
pResult.getFinalValue() will further block until the job is over, and
give the final correct values
I'm seeing a similar thing with a slightly different stack trace. Ideas?
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
Akhil, I hope I'm misreading the tone of this. If you have personal issues
at stake, please take them up outside of the public list. If you have
actual factual concerns about the kafka integration, please share them in a
jira.
Regarding reliability, here's a screenshot of a current production
Many thanks both, appreciate the help.
On Tue, May 12, 2015 at 4:18 PM, Cody Koeninger c...@koeninger.org wrote:
Yes, that's what happens by default.
If you want to be super accurate about it, you can also specify the exact
starting offsets for every topic/partition.
On Tue, May 12, 2015
can you post the explain too ?
Le mar. 12 mai 2015 à 12:11, Jianshi Huang jianshi.hu...@gmail.com a
écrit :
Hi,
I have a SQL query on tables containing big Map columns (thousands of
keys). I found it to be very slow.
select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as
Hi Cody,
I was just saying that i found more success and high throughput with the
low level kafka api prior to KafkfaRDDs which is the future it seems. My
apologies if you felt it that way. :)
On 12 May 2015 19:47, Cody Koeninger c...@koeninger.org wrote:
Akhil, I hope I'm misreading the tone of
I think you may want to try emailing things to the storm users list, not the
spark one
On 11 May 2015, at 15:42, Tyler Mitchell
tyler.mitch...@actian.commailto:tyler.mitch...@actian.com wrote:
I've had good success with splunk generator.
https://github.com/coccyx/eventgen/blob/master/README.md
I know that it is possible to use Zookeeper and File System (not for
production use) to achieve HA.
Are there any other options now or in the near future?
Seeing similar issues, did you find a solution? One would be to increase
the number of partitions if you're doing lots of object creation.
On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com fightf...@163.com
wrote:
Hi, patrick
Really glad to get your reply.
Yes, we are doing group by
yes, the SparkContext in the Python API has a reference to the
JavaSparkContext (jsc)
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
through which you can access the hadoop configuration
On Tue, May 12, 2015 at 6:39 AM, ayan guha guha.a...@gmail.com wrote:
Hi
Yes, that's what happens by default.
If you want to be super accurate about it, you can also specify the exact
starting offsets for every topic/partition.
On Tue, May 12, 2015 at 9:01 AM, James King jakwebin...@gmail.com wrote:
Thanks Cody.
Here are the events:
- Spark app connects to
you need to instantiate a SQLContext :
val sc : SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
Le mar. 12 mai 2015 à 12:29, SLiZn Liu sliznmail...@gmail.com a écrit :
I added `libraryDependencies += org.apache.spark % spark-sql_2.11 %
1.3.1` to `build.sbt`
Hi Reza,
That was the fix we needed. After sorting, the transposed entries are gone!
Thanks a bunch,
rick
On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh r...@databricks.com wrote:
Hi Richard,
One reason that could be happening is that the rows of your matrix are
using SparseVectors, but the
Yeah, fair point about Python.
spark-streaming-kafka should not contain third-party dependencies.
However there's nothing stopping the build from producing an assembly
jar from these modules. I think there is an assembly target already
though?
On Tue, May 12, 2015 at 3:37 PM, Lee McFadden
The low level consumer which Akhil mentioned , has been running in Pearson
for last 4-5 months without any downtime. I think this one is the reliable
Receiver Based Kafka consumer as of today for Spark .. if you say it that
way ..
Prior to Spark 1.3 other Receiver based consumers have used Kafka
Thanks folks, really appreciate all your replies! I tried each of your
suggestions and in particular, *Animesh*‘s second suggestion of *making
case class definition global* helped me getting off the trap.
Plus, I should have paste my entire code with this mail to help the
diagnose.
REGARDS,
Todd
This matrix is the format of a Document Term Matrix. Each row represents all
the words in a single document, each column represents just one of the
possible words, and the elements of the matrix are the corresponding word
counts.
Simple example here
I don't think it's accurate for Akhil to claim that the linked library is
much more flexible/reliable than what's available in Spark at this point.
James, what you're describing is the default behavior for the
createDirectStream api available as part of spark since 1.3. The kafka
parameter
Hi Cody,
If you are so sure, can you share a bench-marking (which you ran for days
maybe?) that you have done with Kafka APIs provided by Spark?
Thanks
Best Regards
On Tue, May 12, 2015 at 7:22 PM, Cody Koeninger c...@koeninger.org wrote:
I don't think it's accurate for Akhil to claim that
Hi
I found this method in scala API but not in python API (1.3.1).
Basically, I want to change blocksize in order to read a binary file using
sc.binaryRecords but with multiple partitions (for testing I want to
generate partitions smaller than default blocksize)/
Is it possible in python? if
Thanks Cody.
Here are the events:
- Spark app connects to Kafka first time and starts consuming
- Messages 1 - 10 arrive at Kafka then Spark app gets them
- Now driver dies
- Messages 11 - 15 arrive at Kafka
- Spark driver program reconnects
- Then Messages 16 - 20 arrive Kafka
What I want is
Hi Felix and Tomoas,
Thanks a lot for your information. I figured out the environment variable
PYSPARK_PYTHON is the secret key.
My current approach is to start iPython notebook on the namenode,
export PYSPARK_PYTHON=/opt/local/anaconda/bin/ipython
/opt/local/anaconda/bin/ipython notebook
Currently external/kafka/pom.xml doesn't cite yammer metrics as dependency.
$ ls -l
~/.m2/repository/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar
-rw-r--r-- 1 tyu staff 82123 Dec 17 2013
/Users/tyu/.m2/repository/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar
It could also be that your hash function is expensive. What is the key class
you have for the reduceByKey / groupByKey?
Matei
On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote:
I'm seeing a similar thing with a slightly different stack trace. Ideas?
Hi,
I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition. Have folks looked at ways of speeding
up this type of ingestion?
Let's say the main part of the ingest
I would be cautious regarding use of spark.cleaner.ttl, as it can lead to
confusing error messages if time-based cleaning deletes resources that are
still needed. See my comment at
bq. it is already in the assembly
Yes. Verified:
$ jar tvf ~/Downloads/spark-streaming-kafka-assembly_2.10-1.3.1.jar |
grep yammer | grep Gauge
1329 Sat Apr 11 04:25:50 PDT 2015 com/yammer/metrics/core/Gauge.class
On Tue, May 12, 2015 at 8:05 AM, Sean Owen so...@cloudera.com wrote:
It
69 matches
Mail list logo