I am calling spark-submit passing maxRate, I have a single kinesis receiver,
and batches of 1s
spark-submit --conf spark.streaming.receiver.maxRate=10
however a single batch can greatly exceed the stablished maxRate. i.e: Im
getting 300 records.
Am I missing any setting?
--
View this
Bcc dev@ and add user@
The dev list is not meant for users to ask questions on how to use Spark.
For that you should use StackOverflow or the user@ list.
scala> sql("select 1 & 2").show()
+---+
|(1 & 2)|
+---+
| 0|
+---+
scala> sql("select 1 & 3").show()
+---+
|(1 & 3)|
On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran
wrote:
>
> irrespective of naming, know that deep directory trees are performance
> killers when listing files on s3 and setting up jobs. You might actually be
> better off having them in the same directory and using a
Hi all,
How shuffle in Spark 1.6.2 work? I am using groupbykey(int: partitionSize).
groupbykey, a shuffle operation, has mapper side (M mappers) and reducer
side (R reducers).
Here R=partitionSize, and each mapper will produce a local file output and
store in spark.local.dir. Let's assume total
Hello, I am trying to test Spark's SQL window functions in the following blog,
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html,
and facing a problem as follows:# testing rowsBetween()winSpec2 =
You could open up a JIRA to add a version of from_json that supports schema
inference, but unfortunately that would not be super easy to implement. In
particular, it would introduce a weird case where only this specific
function would block for a long time while we infer the schema (instead of
Hello All,
I just want to make sure this is a right use case for Kafka --> Spark
Streaming
Few words about my use case :
When the user watches a video, I get the position events from the user that
indicates how much they have completed viewing and at a certain point, I
mark that Video as
Hi,
If you disable tracker-side WAL, you unset a checkpoint dir by using
streamingContext.checkpoint().
http://spark.apache.org/docs/latest/streaming-programming-guide.html#how-to-configure-checkpointing
// maropu
On Tue, Nov 29, 2016 at 9:04 AM, Tim Harsch wrote:
> Hi all,
Hi all,
I set `spark.streaming.receiver.writeAheadLog.enable=false` and my history
server confirms the property has been set. Yet, I continue to see the error:
16/11/28 15:47:04 ERROR util.FileBasedWriteAheadLog_ReceivedBlockTracker:
Failed to write to write ahead log after 3 failures
I
In this case, persisting to Cassandra is for future analytics and
Visualization.
I want to notify that the app of the event, so it makes the app interactive.
Thanks
On Mon, Nov 28, 2016 at 2:24 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:
> Sorry I don't understand...
> Is
Sorry I don't understand...
Is it a cassandra acknowledge to actors that you want ? Why do you want to
ack after writing to cassandra ? Your pipeline kafka=>spark=>cassandra is
supposed to be exactly once, so you don't need to wait for cassandra ack,
you can just write to kafka from actors and
Hi All,
The files like below are just filling up the disk quickly. I am using a
standalone cluster so what setting do I need to change this into rolling
log or something to avoid filling up the disk?
spark/work/app-20161128185548/1/stderr
Thanks,
kant
Thanks Vincent for the input. Not sure I understand your suggestion. Please
clarify.
Few words about my use case :
When the user watches a video, I get the position events from the user that
indicates how much they have completed viewing and at a certain point, I
mark that Video as complete and
Uhm, this link
https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
seems to indicate you can do it.
hth
On Mon, Nov 28, 2016 at 9:55 PM, Didac Gil wrote:
> Any suggestions for using something like OneHotEncoder and
Any suggestions for using something like OneHotEncoder and StringIndexer on
an InputDStream?
I could try to combine an Indexer based on a static parquet but I want to
use the OneHotEncoder approach in Streaming data coming from a socket.
Thanks!
Dídac Gil de la Iglesia
Hi Andrew,
sorry but to me it seems s3 is the culprit
I have downloaded your json file and stored locally. Then write this simple
app (a subset of what you have in ur github, sorry i m littebit rusty on
how to create new column out of existing ones) which basically read the
json file
It's in
Can we get a bot that auto-subscribes folks to cat-facts! when they email
the user list instead of user-unsubscr...@spark.apache.org ?
On Mon, Nov 28, 2016 at 1:28 PM R. Revert wrote:
> Unsubscribe
>
> El 28 nov. 2016 5:22 p. m., escribió:
>
>
Unsubscribe
El 28 nov. 2016 5:22 p. m., escribió:
> Unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
Unsubscribe
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
You don't need actors to do kafka=>spark processing=>kafka
Why do you need to notify the akka producer ? If you need to get back the
processed message in your producer, then implement an akka consummer in
your akka app and kafka offsets will do the job
2016-11-28 21:46 GMT+01:00 shyla deshpande
Thanks Daniel for the response.
I am planning to use Spark streaming to do Event Processing. I will have
akka actors sending messages to kafka. I process them using Spark streaming
and as a result a new events will be generated. How do I notify the akka
actor(Message producer) that a new event
They should take same time if everything else is constant
On 28 Nov 2016 23:41, "Hitesh Goyal" wrote:
> Hi team, I am using spark SQL for accessing the amazon S3 bucket data.
>
> If I run a sql query by using normal SQL syntax like below
>
> 1) DataFrame
I just stumbled upon this issue as well in Spark 1.6.2 when trying to write
my own custom Sink. For anyone else who runs into this issue, there are
two relevant JIRAs that I found, but no solution as of yet:
- https://issues.apache.org/jira/browse/SPARK-14151 - Propose to refactor
and expose
Well, I would say it depends on what you're trying to achieve. Right now I
don't know why you are considering using Akka. Could you please explain
your use case a bit?
In general, there is no single correct answer to your current question as
it's quite broad.
Daniel
On Mon, Nov 28, 2016 at 9:11
Anyone with experience of spark streaming in production, appreciate your
input.
Thanks
-shyla
On Mon, Nov 28, 2016 at 12:11 AM, shyla deshpande
wrote:
> My data pipeline is Kafka --> Spark Streaming --> Cassandra.
>
> Can someone please explain me when would I need to
Thanks, I will look into the classpaths and check.
On Mon, Nov 21, 2016 at 3:28 PM, Jakob Odersky wrote:
> The issue I was having had to do with missing classpath settings; in
> sbt it can be solved by setting `fork:=true` to run tests in new jvms
> with appropriate
Unsubscribe
Would you please provide a simple code snippet demonstrating the
problem and also the error message you're receiving?
On Mon, Nov 28, 2016 at 12:12 AM, Hitesh Goyal
wrote:
> I tried this, but it is throwing an error that the method "when" is not
> applicable.
> I am
If your query plan has "Project" in it, there is a bug in Spark preventing
"broadcast" hint working in pre-2.0 release.
https://issues.apache.org/jira/browse/SPARK-13383
Unfortunately, there is no port fix in 1.x.
Yong
From: Anton Okolnychyi
I extracted out the boto bits and tested in vanilla python on the nodes. I
am pretty sure that the data from S3 is ok. I've applied a public policy to
the bucket s3://time-waits-for-no-man. There is a publicly available object
here:
Try limit the partitions. spark.sql.shuffle.partitions
This control the number of files generated.
On 28 Nov 2016 8:29 p.m., "Kevin Tran" wrote:
> Hi Denny,
> Thank you for your inputs. I also use 128 MB but still too many files
> generated by Spark app which is only ~14 KB
Hello,
In my project, I would like to use logback as logging framework ( faster,
memory footprint, etc ...)
I have managed to make it work however I had to modify the spark jars folder
- remove slf4j-log4jxx.jar
- add logback-classic / logback-core.jar
And add logback.xml in conf folder.
Is it
Hi team, I am using spark SQL for accessing the amazon S3 bucket data.
If I run a sql query by using normal SQL syntax like below
1) DataFrame d=sqlContext.sql(i.e. Select * from tablename where
column_condition);
Secondly, if I use dataframe functions for the same query like below :-
2)
Hi Denny,
Thank you for your inputs. I also use 128 MB but still too many files
generated by Spark app which is only ~14 KB each ! That's why I'm asking if
there is a solution for this if some one has same issue.
Cheers,
Kevin.
On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee
irrespective of naming, know that deep directory trees are performance killers
when listing files on s3 and setting up jobs. You might actually be better off
having them in the same directory and using a pattern like 2016-03-11-*
as the pattten to find files.
On 28 Nov 2016, at 04:18,
Found the exact issue. If the vertex attribute is a complex object with
mutable objects the edge triplet does not update the new state once already
the vertex attributes are shipped but if the vertex attributes are immutable
objects then there is no issue. below is a code for the same. Just
My data pipeline is Kafka --> Spark Streaming --> Cassandra.
Can someone please explain me when would I need to wrap akka around the
spark streaming app. My knowledge of akka and the actor system is poor.
Please help!
Thanks
This is because currently GBTClassifier doesn't extend the
ClassificationModel abstract class, which in turn has the rawPredictionCol
and related methods for generating that column.
I'm actually not sure off hand whether this was because the GBT
implementation could not produce the raw prediction
38 matches
Mail list logo