I also met the same issue. Any updates on this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Windowed-Operations-tp15133p23094.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, what is the method to create ddf from an RDD which is saved as
objectfile. I don't have a java object but a structtype I want to use as
schema for ddf. How to load the objectfile without the object.
I tried retrieving as Row
val myrdd =
This sounds like a problem that was fixed in Spark 1.3.1.
https://issues.apache.org/jira/browse/SPARK-6351
On Mon, Jun 1, 2015 at 5:44 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
This thread
Hi Antonio,
First, what version of the Spark Cassandra Connector are you using? You are
using Spark 1.3.1, which the Cassandra connector today supports in builds from
the master branch only - the release with public artifacts supporting Spark
1.3.1 is coming soon ;)
Please see
Yes, I also met this issue. And wanna check if you fixed this issue or do you
have other solution for the same goal.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-tp466p23096.html
Sent from
Hi, sparks,
Following is copied from the spark online document
http://spark.apache.org/docs/latest/job-scheduling.html.
Basically, I have two questions on it:
1. If two jobs in an application has dependencies, that is one job depends on
the result of the other job, then I think they will
Hi,
I want to write my RDD to Cassandra database and I took an example from
this site
http://www.datastax.com/dev/blog/accessing-cassandra-from-spark-in-java.
I add that to my project but I have errors. Here is my project in gist
https://gist.github.com/yaseminn/aba86dad9a3e6d6a03dc.
errors :
This thread
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
has various methods on accessing S3 from spark, it might help you.
Thanks
Best Regards
On Sun, May 24, 2015 at 8:03 AM, ogoh oke...@gmail.com wrote:
Hello,
I am
Thanks for your suggestion.
Yes by Dstream.SaveAsTextFile();
I was doing a mistake by writing StorageLevel.NULL while overriding the
storageLevel method in my custom receiver.
When I changed it to StorageLevel.MEMORY_AND_DISK_2() , data started to save at
disk.
Now it’s running without any
#1 I not sure if I got you point, as I known, Xmx is not turn into physical
memory as soon as the process running. it first loaded into virtual memory, if
you heap is need more, it will gradually increase in physical memory until to
the max heap.
#2 Physical memory contains not only heap, but
May be you can make use of the Window operations
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#window-operations,
Also another approach would be to keep your incoming data in
Hbase/Redis/Cassandra kind of database and then whenever you need to
average it, you just query the
Here's a more detailed documentation
https://github.com/datastax/spark-cassandra-connector from Datastax, You
can also shoot an email directly to their mailing list
http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
since its more related to their code.
Thanks
Best
Hello everyone,
I have an idea and I would like to get a validation from community about
it.
In Mahout there is an implementation of Streaming K-means. I'm
interested in your opinion would it make sense to make a similar
implementation of Streaming K-medoids?
K-medoids has even bigger
I haven't given any thought to streaming it, but in case it's useful I do have
a k-medoids implementation for Spark:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.cluster.KMedoids
Also a blog post about multi-threading it:
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority of
completed applications are labeled as not having a history:
Would you mind posting the code?
On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote:
Hi,
In all (pyspark) Spark jobs, that become somewhat more involved, I am
experiencing the issue that some stages take a very long time to complete
and sometimes don't at all. This clearly correlates
UNSUBSCRIBE
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
1. Yes if two tasks depend on each other they cant parallelize
2. Imagine something like a web application driver. You only get to have 1
spark context but now you want to run many concurrent jobs. They have nothing 2
do with each other; no reason to keep them sequential.
Hope this helps
Hi,
In all (pyspark) Spark jobs, that become somewhat more involved, I am
experiencing the issue that some stages take a very long time to
complete and sometimes don't at all. This clearly correlates with the
size of my input data. Looking at the stage details for one such stage,
I am
I am seeing the same issue with Spark 1.3.1.
I see this issue when reading sequence file stored in Sequence File format
(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
)
All i do is
sc.sequenceFile(dwTable, classOf[Text],
Any suggestions ?
I using Spark 1.3.1 to read sequence file stored in Sequence File format
(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
)
with this code and settings
sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
:D very happy Helena I'll check tomorrow morning
A G
Il giorno 01/giu/2015, alle ore 19:45, Helena Edelson
helena.edel...@datastax.com ha scritto:
Hi Antonio,
It’s your lucky day ;) We just released Spark Cassandra Connector 1.3.0-M1
for Spark 1.3 and DataSources API
Give it a little
Hi Yana,
Not sure whether you already solved this issue. As far as I know, the DataFrame
support in Spark Cassandra connector was added in version 1.3. The first
milestone release of SCC v1.3 was just announced.
Mohammed
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Tuesday, May
Thanks, Michael and Akhil.
Yes, it worked with Spark 1.3.1 along with AWS EMR AMI 3.7.
Sorry I didn't update the status.
On Mon, Jun 1, 2015 at 5:17 AM, Michael Armbrust mich...@databricks.com wrote:
This sounds like a problem that was fixed in Spark 1.3.1.
Hi,
Is there any way to force the output RDD of a flatMap op to be stored in
both memory and disk as it is computed ? My RAM would not be able to fit the
entire output of flatMap, so it really needs to starts using disk after the
RAM gets full. I didn't find any way to force this.
Also, what
Hi Antonio,
It’s your lucky day ;) We just released Spark Cassandra Connector 1.3.0-M1 for
Spark 1.3 and DataSources API
Give it a little while to propagate to
http://search.maven.org/#search%7Cga%7C1%7Cspark-cassandra-connector
http://search.maven.org/#search|ga|1|spark-cassandra-connector
Nobody using Spark SQL JDBC/Thrift server with DSE Cassandra?
Mohammed
From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, May 29, 2015 11:49 AM
To: user@spark.apache.org
Subject: Anybody using Spark SQL JDBC server with DSE Cassandra?
Hi -
We have successfully integrated Spark
Hi Helena,
thanks for answering me . . .
I didn't realize it could be the connector version, unfortunately i didn't
try yet.
I know scala is better but i'm using drools and i'm forced to use java
in my project i'm using spark-cassandra-connector-java_2.10
from cassandra I have only this log
INFO
Brant,
You should be able to migrate most of your existing SQL code to Spark SQL, but
remember that Spark SQL does not yet support the full ANSI standard. So you may
need to rewrite some of your existing queries.
Another thing to keep in mind is that Spark SQL is not real-time. The response
switching to use simple pojos instead of using avro for spark serialization
solved the problem(I mean reading avro from s3 and than mapping each avro
object to it's pojo serializable counterpart with same fields, pojo is
registered withing kryo)
Any thought where to look for a
I would like to know what will be the best approach to randomly permute a
Data Frame. I have tried:
df.sample(false,1.0,x).show(100)
where x is the seed. However, it gives the same result no matter the value
of x (it only gives different values when the fraction is smaller than 1.0)
. I have
Ah, apologies, I found an existing issue and fix has already gone out for
this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036.
On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher rmarsc...@localytics.com
wrote:
It looks like it is possibly a race condition between removing the
Hi Suman Meethu,
Apologies---I was wrong about KMeans supporting an initial set of
centroids! JIRA created: https://issues.apache.org/jira/browse/SPARK-8018
If you're interested in submitting a PR, please do!
Thanks,
Joseph
On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW meethu2...@yahoo.co.in
Hello,
I posted this question a while back but am posting it again to get your
attention.
I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS YARN (tested under both
1.3.0 1.3.1).
My hive table is partitioned.
I noticed that the query response time is bad depending on the number of
partitions
It looks like it is possibly a race condition between removing the
IN_PROGRESS and building the history UI for the application.
`AppClient` sends an `UnregisterApplication(appId)` message to the `Master`
actor, which triggers the process to look for the app's eventLogs. If they
are suffixed with
All -
I am facing and odd issue and I am not really sure where to go for support
at this point. I am running MapR which complicates things as it relates to
Mesos, however this HAS worked in the past with no issues so I am stumped
here.
So for starters, here is what I am trying to run. This is a
How much work is to produce a small standalone reproduction? Can you
create an Avro file with some mock data, maybe 10 or so records, then
reproduce this locally?
On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman igor.ber...@gmail.com wrote:
switching to use simple pojos instead of using avro for
Dear all,
Does anyone know how can I force Spark to use only the disk when doing a
simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
Sent from the Apache Spark
Hi Cesar,
try to do:
hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema)
It's a bit inefficient, but should shuffle the whole dataframe.
Thanks,
Peter Rudenko
On 2015-06-01 22:49, Cesar Flores wrote:
I would like to know what will be the best approach to randomly
Could you run the single thread version in worker machine to make sure
that OpenCV is installed and configured correctly?
On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com wrote:
I've verified the issue lies within Spark running OpenCV code and not within
the sequence file
I downloaded the 1.3.1 distro tarball
$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve staff 8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz
However the build on it is failing with an unresolved dependency:
*configuration
not public*
$ build/sbt assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4
Hi,
What are some of the good/adopted approached to monitoring Spark Streaming
from Kafka? I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all
assume that Receiver-based streaming is used?
Then Note that one disadvantage of this approach
Interesting, only in local[*]! In the github you pointed to, what is the
main that you were running.
TD
On Mon, May 25, 2015 at 9:23 AM, rsearle eggsea...@verizon.net wrote:
Further experimentation indicates these problems only occur when master is
local[*].
There are no issues if a
Hi Deepak,
This is a notorious bug that is being tracked at
https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one source
of this bug (it turns out Snappy had a bug in buffer reuse that caused data
corruption). There are other known sources that are being addressed in
outstanding
If you can't run a patched Spark version, then you could also consider
using LZF compression instead, since that codec isn't affected by this bug.
On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote:
Hi Deepak,
This is a notorious bug that is being tracked at
KafkaCluster.scala in the spark/extrernal/kafka project has a bunch of api
code, including code for updating Kafka-managed ZK offsets. Look at
setConsumerOffsets.
Unfortunately all of that code is private, but you can either write your
own, copy it, or do what I do (sed out private[spark] and
In the receiver-less direct approach, there is no concept of consumer
group as we dont use the Kafka High Level consumer (that uses ZK). Instead
Spark Streaming manages offsets on its own, giving tighter guarantees. If
you want to monitor the progress of the processing of offsets, you will
have to
It would be nice to see the code for MapR FS Java API, but my google foo
failed me (assuming it's open source)...
So, shooting in the dark ;) there are a few things I would check, if you
haven't already:
1. Could there be 1.2 versions of some Spark jars that get picked up at run
time (but
I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box. I see more and more people using it to monitor various
components in data processing pipelines, a la
Thank you, Tathagata, Cody, Otis.
- Dmitry
On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:
I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box.
Hello All,
A bit scared I did something stupid...I killed a few PIDs that were
listening to ports 2183 (kafka), 4042 (spark app), some of the PIDs
didn't even seem to be stopped as they still are running when i do
lsof -i:[port number]
I'm not sure if the problem started after or before I did
Does this build Spark for hadoop version 2.6.0?
build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
package
Thanks!
Hello Josh,
Are you suggesting to store the source data in LZF compression and use the
same Spark code as is ?
Currently its stored in sequence file format and compressed with GZIP.
First line of the data:
(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'
I just ran the same app with limited data on my personal machine - no error.
Seems to be a mesos issue. Will investigate further. If anyone knows
anything, let me know :)
--
View this message in context:
The second one sounds reasonable, I think.
On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Hi everyone,
Let's assume I have a complex workflow of more than 10 datasources as input
- 20 computations (some creating intermediary datasets and some merging
No, all of the RDDs (including those returned from randomSplit()) are read-only.
On Mon, Apr 27, 2015 at 11:28 AM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
Suppose I have something like the code below
for idx in xrange(0, 10):
train_test_split =
When I start the Spark master process, the old records are not shown in
the monitoring UI.
How to show the old records? Thank you very much!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands,
Looks good.
-Dhadoop.version is not needed because the profile already defines it.
profile
idhadoop-2.6/id
properties
hadoop.version2.6.0/hadoop.version
On Mon, Jun 1, 2015 at 5:51 PM, Mulugeta Mammo mulugeta.abe...@gmail.com
wrote:
Does this build Spark for hadoop
58 matches
Mail list logo