Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-11-28 Thread dav009
I am calling spark-submit passing maxRate, I have a single kinesis receiver, and batches of 1s spark-submit --conf spark.streaming.receiver.maxRate=10 however a single batch can greatly exceed the stablished maxRate. i.e: Im getting 300 records. Am I missing any setting? -- View this

Re: Bit-wise AND operation between integers

2016-11-28 Thread Reynold Xin
Bcc dev@ and add user@ The dev list is not meant for users to ask questions on how to use Spark. For that you should use StackOverflow or the user@ list. scala> sql("select 1 & 2").show() +---+ |(1 & 2)| +---+ | 0| +---+ scala> sql("select 1 & 3").show() +---+ |(1 & 3)|

Re: Spark ignoring partition names without equals (=) separator

2016-11-28 Thread Prasanna Santhanam
On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran wrote: > > irrespective of naming, know that deep directory trees are performance > killers when listing files on s3 and setting up jobs. You might actually be > better off having them in the same directory and using a

groupbykey data access size vs Reducer number

2016-11-28 Thread memoryzpp
Hi all, How shuffle in Spark 1.6.2 work? I am using groupbykey(int: partitionSize). groupbykey, a shuffle operation, has mapper side (M mappers) and reducer side (R reducers). Here R=partitionSize, and each mapper will produce a local file output and store in spark.local.dir. Let's assume total

null values returned by max() over a window function

2016-11-28 Thread Han-Cheol Cho
Hello, I am trying to test Spark's SQL window functions in the following blog, https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html, and facing a problem as follows:# testing rowsBetween()winSpec2 =

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-28 Thread Michael Armbrust
You could open up a JIRA to add a version of from_json that supports schema inference, but unfortunately that would not be super easy to implement. In particular, it would introduce a weird case where only this specific function would block for a long time while we infer the schema (instead of

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
Hello All, I just want to make sure this is a right use case for Kafka --> Spark Streaming Few words about my use case : When the user watches a video, I get the position events from the user that indicates how much they have completed viewing and at a certain point, I mark that Video as

Re: How to disable write ahead logs?

2016-11-28 Thread Takeshi Yamamuro
Hi, If you disable tracker-side WAL, you unset a checkpoint dir by using streamingContext.checkpoint(). http://spark.apache.org/docs/latest/streaming-programming-guide.html#how-to-configure-checkpointing // maropu On Tue, Nov 29, 2016 at 9:04 AM, Tim Harsch wrote: > Hi all,

How to disable write ahead logs?

2016-11-28 Thread Tim Harsch
Hi all, I set `spark.streaming.receiver.writeAheadLog.enable=false` and my history server confirms the property has been set. Yet, I continue to see the error: 16/11/28 15:47:04 ERROR util.FileBasedWriteAheadLog_ReceivedBlockTracker: Failed to write to write ahead log after 3 failures I

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
In this case, persisting to Cassandra is for future analytics and Visualization. I want to notify that the app of the event, so it makes the app interactive. Thanks On Mon, Nov 28, 2016 at 2:24 PM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > Sorry I don't understand... > Is

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread vincent gromakowski
Sorry I don't understand... Is it a cassandra acknowledge to actors that you want ? Why do you want to ack after writing to cassandra ? Your pipeline kafka=>spark=>cassandra is supposed to be exactly once, so you don't need to wait for cassandra ack, you can just write to kafka from actors and

What do I set rolling log to avoid filling up the disk?

2016-11-28 Thread kant kodali
Hi All, The files like below are just filling up the disk quickly. I am using a standalone cluster so what setting do I need to change this into rolling log or something to avoid filling up the disk? spark/work/app-20161128185548/1/stderr Thanks, kant

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
Thanks Vincent for the input. Not sure I understand your suggestion. Please clarify. Few words about my use case : When the user watches a video, I get the position events from the user that indicates how much they have completed viewing and at a certain point, I mark that Video as complete and

Re:

2016-11-28 Thread Marco Mistroni
Uhm, this link https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations seems to indicate you can do it. hth On Mon, Nov 28, 2016 at 9:55 PM, Didac Gil wrote: > Any suggestions for using something like OneHotEncoder and

[no subject]

2016-11-28 Thread Didac Gil
Any suggestions for using something like OneHotEncoder and StringIndexer on an InputDStream? I could try to combine an Indexer based on a static parquet but I want to use the OneHotEncoder approach in Streaming data coming from a socket. Thanks! Dídac Gil de la Iglesia

Re: createDataFrame causing a strange error.

2016-11-28 Thread Marco Mistroni
Hi Andrew, sorry but to me it seems s3 is the culprit I have downloaded your json file and stored locally. Then write this simple app (a subset of what you have in ur github, sorry i m littebit rusty on how to create new column out of existing ones) which basically read the json file It's in

Re: Unsubscribe

2016-11-28 Thread Charles Allen
Can we get a bot that auto-subscribes folks to cat-facts! when they email the user list instead of user-unsubscr...@spark.apache.org ? On Mon, Nov 28, 2016 at 1:28 PM R. Revert wrote: > Unsubscribe > > El 28 nov. 2016 5:22 p. m., escribió: > >

Unsubscribe

2016-11-28 Thread R. Revert
Unsubscribe El 28 nov. 2016 5:22 p. m., escribió: > Unsubscribe > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >

Unsubscribe

2016-11-28 Thread ryou . hasegawa
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread vincent gromakowski
You don't need actors to do kafka=>spark processing=>kafka Why do you need to notify the akka producer ? If you need to get back the processed message in your producer, then implement an akka consummer in your akka app and kafka offsets will do the job 2016-11-28 21:46 GMT+01:00 shyla deshpande

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
Thanks Daniel for the response. I am planning to use Spark streaming to do Event Processing. I will have akka actors sending messages to kafka. I process them using Spark streaming and as a result a new events will be generated. How do I notify the akka actor(Message producer) that a new event

Re: time to run Spark SQL query

2016-11-28 Thread ayan guha
They should take same time if everything else is constant On 28 Nov 2016 23:41, "Hitesh Goyal" wrote: > Hi team, I am using spark SQL for accessing the amazon S3 bucket data. > > If I run a sql query by using normal SQL syntax like below > > 1) DataFrame

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-11-28 Thread Matthew Dailey
I just stumbled upon this issue as well in Spark 1.6.2 when trying to write my own custom Sink. For anyone else who runs into this issue, there are two relevant JIRAs that I found, but no solution as of yet: - https://issues.apache.org/jira/browse/SPARK-14151 - Propose to refactor and expose

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread Daniel van der Ende
Well, I would say it depends on what you're trying to achieve. Right now I don't know why you are considering using Akka. Could you please explain your use case a bit? In general, there is no single correct answer to your current question as it's quite broad. Daniel On Mon, Nov 28, 2016 at 9:11

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
Anyone with experience of spark streaming in production, appreciate your input. Thanks -shyla On Mon, Nov 28, 2016 at 12:11 AM, shyla deshpande wrote: > My data pipeline is Kafka --> Spark Streaming --> Cassandra. > > Can someone please explain me when would I need to

Re: SparkILoop doesn't run

2016-11-28 Thread Mohit Jaggi
Thanks, I will look into the classpaths and check. On Mon, Nov 21, 2016 at 3:28 PM, Jakob Odersky wrote: > The issue I was having had to do with missing classpath settings; in > sbt it can be solved by setting `fork:=true` to run tests in new jvms > with appropriate

Unsubscribe

2016-11-28 Thread Ninad Shringarpure
Unsubscribe

Re: if conditions

2016-11-28 Thread Stuart White
Would you please provide a simple code snippet demonstrating the problem and also the error message you're receiving? On Mon, Nov 28, 2016 at 12:12 AM, Hitesh Goyal wrote: > I tried this, but it is throwing an error that the method "when" is not > applicable. > I am

Re: Dataframe broadcast join hint not working

2016-11-28 Thread Yong Zhang
If your query plan has "Project" in it, there is a bug in Spark preventing "broadcast" hint working in pre-2.0 release. https://issues.apache.org/jira/browse/SPARK-13383 Unfortunately, there is no port fix in 1.x. Yong From: Anton Okolnychyi

Re: createDataFrame causing a strange error.

2016-11-28 Thread Andrew Holway
I extracted out the boto bits and tested in vanilla python on the nodes. I am pretty sure that the data from S3 is ok. I've applied a public policy to the bucket s3://time-waits-for-no-man. There is a publicly available object here:

Re: Spark app write too many small parquet files

2016-11-28 Thread Chin Wei Low
Try limit the partitions. spark.sql.shuffle.partitions This control the number of files generated. On 28 Nov 2016 8:29 p.m., "Kevin Tran" wrote: > Hi Denny, > Thank you for your inputs. I also use 128 MB but still too many files > generated by Spark app which is only ~14 KB

How to use logback

2016-11-28 Thread Erwan ALLAIN
Hello, In my project, I would like to use logback as logging framework ( faster, memory footprint, etc ...) I have managed to make it work however I had to modify the spark jars folder - remove slf4j-log4jxx.jar - add logback-classic / logback-core.jar And add logback.xml in conf folder. Is it

time to run Spark SQL query

2016-11-28 Thread Hitesh Goyal
Hi team, I am using spark SQL for accessing the amazon S3 bucket data. If I run a sql query by using normal SQL syntax like below 1) DataFrame d=sqlContext.sql(i.e. Select * from tablename where column_condition); Secondly, if I use dataframe functions for the same query like below :- 2)

Re: Spark app write too many small parquet files

2016-11-28 Thread Kevin Tran
Hi Denny, Thank you for your inputs. I also use 128 MB but still too many files generated by Spark app which is only ~14 KB each ! That's why I'm asking if there is a solution for this if some one has same issue. Cheers, Kevin. On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee

Re: Spark ignoring partition names without equals (=) separator

2016-11-28 Thread Steve Loughran
irrespective of naming, know that deep directory trees are performance killers when listing files on s3 and setting up jobs. You might actually be better off having them in the same directory and using a pattern like 2016-03-11-* as the pattten to find files. On 28 Nov 2016, at 04:18,

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-28 Thread rohit13k
Found the exact issue. If the vertex attribute is a complex object with mutable objects the edge triplet does not update the new state once already the vertex attributes are shipped but if the vertex attributes are immutable objects then there is no issue. below is a code for the same. Just

Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
My data pipeline is Kafka --> Spark Streaming --> Cassandra. Can someone please explain me when would I need to wrap akka around the spark streaming app. My knowledge of akka and the actor system is poor. Please help! Thanks

Re: how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-28 Thread Nick Pentreath
This is because currently GBTClassifier doesn't extend the ClassificationModel abstract class, which in turn has the rawPredictionCol and related methods for generating that column. I'm actually not sure off hand whether this was because the GBT implementation could not produce the raw prediction