Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-26 Thread Cassa L
No, I dont use Yarn. This is standalone spark that comes with DataStax Enterprise version of Cassandra. On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke wrote: > Do you use yarn ? Then you need to configure the queues with the right > scheduler and method. > > On 27. Oct 2017, at 08:05, Cassa L w

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-26 Thread Jörn Franke
See also https://spark.apache.org/docs/latest/job-scheduling.html > On 27. Oct 2017, at 08:05, Cassa L wrote: > > Hi, > I have a spark job that has use case as below: > RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some > transformation and after that I do a count on transfo

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-26 Thread Jörn Franke
Do you use yarn ? Then you need to configure the queues with the right scheduler and method. > On 27. Oct 2017, at 08:05, Cassa L wrote: > > Hi, > I have a spark job that has use case as below: > RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some > transformation and after

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Piyush Mukati
Thanks, Michael I have explored Aggregator with update mode. The problem is it will give the overall aggregated value for the changed. while I only want the delta change in the group as the aggre

RE: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Zhang, Liyun
Thanks your suggestion, seems that scala 2.12.4 support jdk9 Scala 2.12.4 is now available. Our benchmarks show a furt

Structured streaming with event hubs

2017-10-26 Thread KhajaAsmath Mohammed
Hi, Could anyone share if there is any code snippet on how to use spark structured streaming with event hubs ?? Thanks, Asmath Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Reynold Xin
It probably depends on the Scala version we use in Spark supporting Java 9 first. On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun wrote: > Hi all: > > 1. I want to build spark on jdk9 and test it with Hadoop on jdk9 > env. I search for jiras related to JDK9. I only found SPARK-13278 >

Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Zhang, Liyun
Hi all: 1. I want to build spark on jdk9 and test it with Hadoop on jdk9 env. I search for jiras related to JDK9. I only found SPARK-13278. This means now spark can build or run successfully on JDK9 ? Best Regards Kelly Zhang/Zhang,Liyu

Re: Unexpected caching behavior

2017-10-26 Thread pnpritchard
I'm not sure why the example code didn't come through, so I'll try again: val x = spark.range(100) val y = x.map(_.toString) println(x.storageLevel) //StorageLevel(1 replicas) println(y.storageLevel) //StorageLevel(1 replicas) x.cache().foreachPartition(_ => ()) y.cache().foreachPartition(_ => (

Re: Unexpected caching behavior

2017-10-26 Thread pnpritchard
Not sure why the example code didn't come through, but here I'll try again: val x = spark.range(100) val y = x.map(_.toString) println(x.storageLevel) //StorageLevel(1 replicas) println(y.storageLevel) //StorageLevel(1 replicas) x.cache().foreachPartition(_ => ()) y.cache().foreachPartition(_ =>

Unexpected caching behavior

2017-10-26 Thread pnpritchard
I've noticed that when unpersisting an "upstream" Dataset, then the "downstream" Dataset is also unpersisted. I did not expect this behavior, and I've noticed that RDDs do not have this behavior. Below I've pasted a simple reproducible case. There are two datasets, x and y, where y is created by a

Re: Job spark blocked and runs indefinitely

2017-10-26 Thread Timur Shenkao
HBase has its own Java API and Scala API: you can use what you like. Btw, which Spark-Hbase connector do you use? Cloudera or Hortonworks? On Wed, Oct 11, 2017 at 3:01 PM, Amine CHERIFI < cherifimohamedam...@gmail.com> wrote: > it seems that the job block whene we call newAPIHadoopRDD to get data

How many jobs are left to calculate estimated time

2017-10-26 Thread Abdullah Bashir
Hi I am running svd function on on 45GB data csv with 8.9M rows. I have configured BLAST and ARPACK. So it's 14.8 hours since my job is running and from Spark UI on port 4040 *Cores* *Memory per Executor* *State* *Duration* 160 15.0 GB RUNNING 14.4 h In Jobs UI i am seeing *Job Id **▾*

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Ok, so for JDBC I presume it defaults to a single partition if you don't provide partitioning meta data? Thanks! Gary On 26 October 2017 at 13:43, Daniel Siegmann wrote: > Those settings apply when a shuffle happens. But they don't affect the way > the data will be partitioned when it is initi

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
Those settings apply when a shuffle happens. But they don't affect the way the data will be partitioned when it is initially read, for example spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends on how the data is split into chunks, but if there are lots of small chunks Spark w

Re: Suggestions on using scala/python for Spark Streaming

2017-10-26 Thread Sebastian Piu
Have a look at how pyspark works in conjunction with spark as it is not just a matter of language preference. There are several implications and a performance price to pay if you go with python. At the end of the day only you can answer whether that price is worth over retraining your team in anot

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Thanks Daniel! I've been wondering that for ages! IE where my JDBC sourced datasets are coming up with 200 partitions on write to S3. What do you mean for (except for the initial read)? Can you explain that a bit further? Gary Lucas On 26 October 2017 at 11:28, Daniel Siegmann wrote: > When

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
When working with datasets, Spark uses spark.sql.shuffle.partitions. It defaults to 200. Between that and the default parallelism you can control the number of partitions (except for the initial read). More info here: http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configurati

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Deepak Sharma
I guess the issue is spark.default.parallelism is ignored when you are working with Data frames.It is supposed to work with only raw RDDs. Thanks Deepak On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda < noo...@noorul.com> wrote: > Hi all, > > I have the following spark configurati

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
I think we'd need to see the code that loads the df. Parallelism and partition count are related but they're not the same. I've found the documentation fuzzy on this, but it looks like default.parrallelism is what spark uses for partitioning when it has no other guidance. I'm also under the impr

Controlling number of spark partitions in dataframes

2017-10-26 Thread Noorul Islam Kamal Malmiyoda
Hi all, I have the following spark configuration spark.app.name=Test spark.cassandra.connection.host=127.0.0.1 spark.cassandra.connection.keep_alive_ms=5000 spark.cassandra.connection.port=1 spark.cassandra.connection.timeout_ms=3 spark.cleaner.ttl=3600 spark.default.parallelism=4 spark.m

Re: Suggestions on using scala/python for Spark Streaming

2017-10-26 Thread lucas.g...@gmail.com
I don't have any specific wisdom for you on that front. But I've always been served well by the 'Try both' approach. Set up your benchmarks, configure both setups... You don't have to go the whole hog, but just enough to get a mostly realistic implementation functional. Run them both with some

Suggestions on using scala/python for Spark Streaming

2017-10-26 Thread umargeek
We are building a spark streaming application which is process and time intensive and currently using python API but looking forward for suggestions whether to use Scala over python such as pro's and con's as we are planning to production setup as next step? Thanks, Umar -- Sent from: http://a

Re: Structured Stream in Spark

2017-10-26 Thread KhajaAsmath Mohammed
Thanks TD. On Wed, Oct 25, 2017 at 6:42 PM, Tathagata Das wrote: > Please do not confuse old Spark Streaming (DStreams) with Structured > Streaming. Structured Streaming's offset and checkpoint management is far > more robust than DStreams. > Take a look at my talk - https://spark-summit.org/ >

Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Deepak Sharma
df.rdd.foreach Thanks Deepak On Oct 26, 2017 18:07, "Noorul Islam Kamal Malmiyoda" wrote: > Hi all, > > I have a Dataframe with 1000 records. I want to split them into 100 > each and post to rest API. > > If it was RDD, I could use something like this > > myRDD.foreachRDD { > rdd => >

Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Jean Georges Perrin
Just hints: Repartition in 10? Get the RDD from the dataframe? What about a forEach row and send every 100? (I just did that actually) jg > On Oct 26, 2017, at 13:37, Noorul Islam Kamal Malmiyoda > wrote: > > Hi all, > > I have a Dataframe with 1000 records. I want to split them into 100 >

What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Noorul Islam Kamal Malmiyoda
Hi all, I have a Dataframe with 1000 records. I want to split them into 100 each and post to rest API. If it was RDD, I could use something like this myRDD.foreachRDD { rdd => rdd.foreachPartition { partition => { This will ensure that code is executed on executors a

Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-26 Thread Prashant Sharma
Hi Darshan, Did you try passing the config directly as an option, like this: .option("kafka.sasl.jaas.config", saslConfig) Where saslConfig can look like: com.sun.security.auth.module.Krb5LoginModule required \ useKeyTab=true \ storeKey=true \ keyTab="/etc/security/key

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev I think you should be able to write an Aggregator . You probably want to run in update mode if you are looking for it to output any group that has changed in the batch. On Wed, Oct 25, 201

Re: text processing in spark (Spark job stucks for several minutes)

2017-10-26 Thread Jörn Franke
Please provide source code and exceptions that are in executor and/or driver log. > On 26. Oct 2017, at 08:42, Donni Khan wrote: > > Hi, > I'm applying preprocessing methods on big data of text by using spark-Java. I > created my own NLP pipline as a normal java code and call it in the map >