Re: Spark S3

2016-10-10 Thread Selvam Raman
I mentioned parquet as input format. On Oct 10, 2016 11:06 PM, "ayan guha" wrote: > It really depends on the input format used. > On 11 Oct 2016 08:46, "Selvam Raman" wrote: > >> Hi, >> >> How spark reads data from s3 and runs parallel task. >> >> Assume I

GraphFrame BFS

2016-10-10 Thread cashinpj
Hello, I have a set of data representing various network connections. Each vertex is represented by a single id, while the edges have a source id, destination id, and a relationship (peer to peer, customer to provider, or provider to customer). I am trying to create a sub graph build around a

Kryo on Zeppelin

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark. serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I

[no subject]

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I

Re: [Spark] RDDs are not persisting in memory

2016-10-10 Thread Chin Wei Low
Hi, Your RDD is 5GB, perhaps it is too large to fit into executor's storage memory. You can refer to the Executors tab in Spark UI to check the available memory for storage for each of the executor. Regards, Chin Wei On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports [{...}, {...} ...] Or {...} format as input. On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote: > Thanks Luciano - I think this is my issue :( > > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at >

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
Repartition almost always involves a shuffle. Let me see if I can explain the recovery stuff... Say you start with two kafka partitions, topic-0 and topic-1. You shuffle those across 3 spark parittions, we'll label them A B and C. Your job is has written fileA: results for A, offset ranges

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Just thought of another potential issue: you should use the "provided" scope when depending on spark. I.e in your project's pom: org.apache.spark spark-core_2.11 2.0.1 provided On Mon, Oct 10, 2016 at 2:00 PM, Jakob Odersky

Design consideration for a trading System

2016-10-10 Thread Mich Talebzadeh
Hi, I have been working on some Lambda Architecture for trading systems. I think I have completed the dry runs for testing the modules. For batch layer the criteria is a day's lag (one day old data). This is acceptable for the users who come from BI background using Tableau but I think we can

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
On 10/10/2016 8:14 PM, Cody Koeninger wrote: Glad it was helpful :) As far as executors, my expectation is that if you have multiple executors running, and one of them crashes, the failed task will be submitted on a different executor. That is typically what I observe in spark apps, if

[Spark] RDDs are not persisting in memory

2016-10-10 Thread diplomatic Guru
Hello team, Spark version: 1.6.0 I'm trying to persist done data into memory for reusing them. However, when I call rdd.cache() OR rdd.persist(StorageLevel.MEMORY_ONLY()) it does not store the data as I can not see any rdd information under WebUI (Storage Tab). Therefore I tried

Re: Spark S3

2016-10-10 Thread ayan guha
It really depends on the input format used. On 11 Oct 2016 08:46, "Selvam Raman" wrote: > Hi, > > How spark reads data from s3 and runs parallel task. > > Assume I have a s3 bucket size of 35 GB( parquet file). > > How the sparksession will read the data and process the data

Re: Spark Streaming Advice

2016-10-10 Thread Jörn Franke
Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes. > On 10 Oct 2016, at 16:25, Kevin Mellott wrote: > > Whilst working on this application, I found a setting that drastically > improved the

Spark S3

2016-10-10 Thread Selvam Raman
Hi, How spark reads data from s3 and runs parallel task. Assume I have a s3 bucket size of 35 GB( parquet file). How the sparksession will read the data and process the data parallel. How it splits the s3 data and assign to each executor task. ​Please share me your points. Note: if we have

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Ho do you submit the application? A version mismatch between the launcher, driver and workers could lead to the bug you're seeing. A common reason for a mismatch is if the SPARK_HOME environment variable is set. This will cause the spark-submit script to use the launcher determined by that

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
The batch interval was set to 30 seconds; however, after getting the parquet files to save faster I lowered the interval to 10 seconds. The number of log messages contained in each batch varied from just a few up to around 3500, with the number of partitions ranging from 1 to around 15. I will

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Thanks Luciano - I think this is my issue :( > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets >

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread Cody Koeninger
This should give you hints on the necessary cast: http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#tab_java_2 The main ugly thing there is that the java rdd is wrapping the scala rdd, so you need to unwrap one layer via rdd.rdd() If anyone wants to work on a PR to update

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Thanks! I am ok with strict rules (despite being French), but even: [{ "red": "#f00", "green": "#0f0" },{ "red": "#f01", "green": "#0f1" }] is not going through… Is there a way to see what he does not like? the JSON parser has been pretty good to me until

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
Glad it was helpful :) As far as executors, my expectation is that if you have multiple executors running, and one of them crashes, the failed task will be submitted on a different executor. That is typically what I observe in spark apps, if that's not what you're seeing I'd try to get help on

Re: JSON Arrays and Spark

2016-10-10 Thread Luciano Resende
Please take a look at http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Particularly the note at the required format : Note that the file that is offered as *a json file* is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.

converting hBaseRDD to DataFrame

2016-10-10 Thread Mich Talebzadeh
Hi, I am trying to do some operation on an Hbase table that is being populated by Spark Streaming. Now this is just Spark on Hbase as opposed to Spark on Hive -> view on Hbase etc. I also have Phoenix view on this Hbase table. This is sample code scala> val tableName =

Large variation in spark in Task Deserialization Time

2016-10-10 Thread Pulasthi Supun Wickramasinghe
Hi All, I am seeing a huge variation on spark Task Deserialization Time for my collect and reduce operations. while most tasks complete within 100ms a few take mote than a couple of seconds which slows the entire program down. I have attached a screen shot of the web ui where you can see the

Error: PartitioningCollection requires all of its partitionings have the same numPartitions.

2016-10-10 Thread cuevasclemente
Hello, I am having some interesting issues with a consistent error in spark that occurs when I'm working with dataframes that are the result of some amounts of joining and other transformations. PartitioningCollection requires all of its partitionings have the same numPartitions. It seems

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
I just noticed that you're the author of the code I linked in my previous email. :) It's helpful. When using `foreachPartition` or `mapPartitions`, I noticed I can't ask Spark to write the data on the disk using `df.write()` but I need to use the iterator to do so, which means losing the

Re: Inserting New Primary Keys

2016-10-10 Thread Benjamin Kim
Jean, I see your point. For the incremental data, which is very small, I should make sure that the PARTITION BY in the OVER(PARTITION BY ...) is left out so that all the data will be in one partition when assigned a row number. The query below should avoid any problems. “SELECT ROW_NUMBER()

JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Hi folks, I am trying to parse JSON arrays and it’s getting a little crazy (for me at least)… 1) If my JSON is: {"vals":[100,500,600,700,800,200,900,300]} I get: ++ |vals| ++ |[100, 500, 600, 7...| ++ root |-- vals:

Re: Map with state keys serialization

2016-10-10 Thread Shixiong(Ryan) Zhu
That's enough. Did you see any error? On Mon, Oct 10, 2016 at 5:08 AM, Joey Echeverria wrote: > Hi Ryan! > > Do you know where I need to configure Kryo for this? I already have > spark.serializer=org.apache.spark.serializer.KryoSerializer in my > SparkConf and I registered the

Re: Spark Streaming Advice

2016-10-10 Thread Mich Talebzadeh
Hi Kevin, What is the streaming interval (batch interval) above? I do analytics on streaming trade data but after manipulation of individual messages I store the selected on in Hbase. Very fast. HTH Dr Mich Talebzadeh LinkedIn *

Re: Logistic Regression Standardization in ML

2016-10-10 Thread Yanbo Liang
AFAIK, we can guarantee with/without standardization, the models always converged to the same solution if there is no regularization. You can refer the test casts at:

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-10 Thread Matthias Niehoff
Yes, without commiting the data the consumer rebalances. The job consumes 3 streams process them. When consuming only one stream it runs fine. But when consuming three streams, even without joining them, just deserialize the payload and trigger an output action it fails. I will prepare code

Re: Inserting New Primary Keys

2016-10-10 Thread Jean Georges Perrin
Is there only one process adding rows? because this seems a little risky if you have multiple threads doing that… > On Oct 8, 2016, at 1:43 PM, Benjamin Kim wrote: > > Mich, > > After much searching, I found and am trying to use “SELECT ROW_NUMBER() > OVER() + b.id_max

Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread static-max
Hi, by following this article I managed to consume messages from Kafka 0.10 in Spark 2.0: http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html However, the Java examples are missing and I would like to commit the offset myself after processing the RDD. Does anybody have a

Re: spark using two different versions of netty?

2016-10-10 Thread Paweł Szulc
Yeah, I should be more precise. Those are two direct dependencies. On Mon, Oct 10, 2016 at 1:15 PM, Sean Owen wrote: > Usually this sort of thing happens because the two versions are in > different namespaces in different major versions and both are needed. That > is true of

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
What is it you're actually trying to accomplish? On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote: > I managed to make a specific executor crash by using > TaskContext.get.partitionId and throwing an exception for a specific > executor. > > The issue I have now is that the

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation. As my program ingested information into HDFS (as parquet files), I

Logistic Regression Standardization in ML

2016-10-10 Thread Cesar
I have a question regarding how the default standardization in the ML version of the Logistic Regression (Spark 1.6) works. Specifically about the next comments in the Spark Code: /** * Whether to standardize the training features before fitting the model. * The coefficients of models will be