Re: odd caching behavior or accounting

2017-02-09 Thread Hbf
I'm seeing the same behavior in Spark 2.0.1. Does anybody have an explanation? Thanks! Kaspar bmiller1 wrote > Hi All, > > I've recently noticed some caching behavior which I did not understand > and may or may not have indicated a bug. In short, the web UI seemed > to indicate that some

Add hive-site.xml at runtime

2017-02-09 Thread Shivam Sharma
Hi, I have multiple hive configurations(hive-site.xml) and because of that only I am not able to add any hive configuration in spark *conf* directory. I want to add this configuration file at start of any *spark-submit* or *spark-shell*. This conf file is huge so *--conf* is not a option for me.

Re: Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-09 Thread Irving Duran
I would say Java, since it will be somewhat similar to Scala. Now, this assumes that you have some app already written in Scala. If you don't, then pick the language that you feel most comfortable with. Thank you, Irving Duran On Feb 9, 2017, at 11:59 PM, nancy henry

From C* to DataFrames with JSON

2017-02-09 Thread Jean-Francois Gosselin
Hi all, I'm struggling (Spark / Scala newbie) to create a DataFrame from a C* table but also create a DataFrame from column with json . e.g. From C* table | id | jsonData | == | 1 | {"a": "123", "b": "xyz" } | +--+---+ | 2 |

Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-09 Thread nancy henry
Hi All, Is it better to Use Java or Python on Scala for Spark coding.. Mainly My work is with getting file data which is in csv format and I have to do some rule checking and rule aggrgeation and put the final filtered data back to oracle so that real time apps can use it..

回复:Driver hung and happend out of memory while writing to console progress bar

2017-02-09 Thread John Fang
the spark version is 2.1.0 --发件人:方孝健(玄弟) 发送时间:2017年2月10日(星期五) 12:35收件人:spark-dev ; spark-user 主 题:Driver hung and happend out of memory while writing to

Driver hung and happend out of memory while writing to console progress bar

2017-02-09 Thread John Fang
[Stage 172:==> (10328 + 93) / 16144] [Stage 172:==> (10329 + 93) / 16144] [Stage 172:==> (10330 + 93) / 16144] [Stage 172:==>

Question about best Spark tuning

2017-02-09 Thread Ji Yan
Dear spark users, >From this site https://spark.apache.org/docs/latest/tuning.html where it offers recommendation on setting the level of parallelism Clusters will not be fully utilized unless you set the level of parallelism > for each operation high enough. Spark automatically sets the number

Re: Performance bug in UDAF?

2017-02-09 Thread Spark User
Pinging again on this topic. Is there an easy way to select TopN in a RelationalGroupedDataset? Basically in the below example dataSet.groupBy("Column1").agg(udaf("Column2", "Column3") returns a RelationalGroupedDataset. One way to address the data skew would be to reduce the data per key

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-09 Thread Jacek Laskowski
Hi, Yes, that's ForeachWriter. Yes, it works with element by element. You're looking for mapPartition and ForeachWriter has partitionId that you could use to implement a similar thing. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

Re: Dynamic resource allocation to Spark on Mesos

2017-02-09 Thread Michael Gummelt
> by specifying a larger heap size than default on each worker node. I don't follow. Which heap? Are you specifying a large heap size on the executors? If so, do you mean you somehow launch the shuffle service when you launch executors? Or something else? On Wed, Feb 8, 2017 at 5:50 PM, Sun

Re: Counting things in Spark Structured Streaming

2017-02-09 Thread Tathagata Das
Probably something like this. dataset .filter { userData => val dateThreshold = lookupThreshold(record)// look up the threshold date based on the record details userData.date > dateThreshold // compare } .groupBy() .count() This would

Updating variable in foreachRDD

2017-02-09 Thread Mendelson, Assaf
Hi, I was wondering on how foreachRDD would run. Specifically, let's say I do something like (nothing real, just for understanding): var df = ??? var counter = 0 dstream.foreachRDD { rdd: RDD[Long] => { val df2 = rdd.toDF(...) df = df.union(df2) counter += 1 if

java-lang-noclassdeffounderror-org-apache-spark-streaming-api-java-javastreamin

2017-02-09 Thread sathyanarayanan mudhaliyar
Error in the highlighted line. Code, error and pom.xml included below code : final Session session = connector.openSession(); final PreparedStatement prepared = session.prepare("INSERT INTO spark_test5.messages JSON?"); JavaStreamingContext ssc = new

Re: Dataset count on database or parquet

2017-02-09 Thread Suresh Thalamati
If you have to get the data into parquet format for other reasons then I think count() on the parquet should be better. If it just the count you need using database sending dbTable = (select count(*) from ) might be quicker, t will avoid unnecessary data transfer from the database to