RE: top-k function for Window

2017-01-03 Thread Mendelson, Assaf
Assume you have a UDAF which looks like this: - Input: The value - Buffer: K elements - Output: An array (which would have the K elements) - Init: Initialize all elements to some irrelevant value (e.g. int.MinValue) - Update: Start going over the

Migrate spark sql to rdd for better performance

2017-01-03 Thread geoHeil
I optimized a spark sql script but have come to the conclusion that the sql api is not ideal as the tasks which are generated are slow and require too much shuffling. So the script should be converted to rdd http://stackoverflow.com/q/41445571/2587904 How can I formulate this more efficient

Re: OS killing Executor due to high (possibly off heap) memory usage

2017-01-03 Thread Koert Kuipers
it would be great if this offheap memory usage becomes more predictable again. currently i see users put memoryOverhead to many gigabytes, sometimes as much as executor memory. it is trial and error to find out what the right number is. so people dont bother and put in huge numbers instead. On

Dynamic scheduling not respecting spark.executor.cores

2017-01-03 Thread Nirav Patel
When enabling dynamic scheduling I see that all executors are using only 1 core even if I specify "spark.executor.cores" to 6. If dynamic scheduling is disable then each executors will have 6 cores. I have tested this against spark 1.5 . I wonder if this is the same behavior with 2.x as well.

Spark test error

2017-01-03 Thread Yanwei Wayne Zhang
I tried to run the tests in 'GeneralizedLinearRegressionSuite', and all tests passed except for test("read/write") which yielded the following error message. Any suggestion on why this happened and how to fix it? Thanks. BTW, I ran the test in IntelliJ. The default jsonEncode only supports

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, This "inline view" idea is really awesome and enlightens me! Finally I have a plan to move on. I greatly appreciate your help! Best regards, Yang 2017-01-03 18:14 GMT+01:00 ayan guha : > Ahh I see what you meanI confused two terminologiesbecause we were >

Re: Apache Hive with Spark Configuration

2017-01-03 Thread Ryan Blue
Chetan, Spark is currently using Hive 1.2.1 to interact with the Metastore. Using that version for Hive is going to be the most reliable, but the metastore API doesn't change very often and we've found (from having different versions as well) that older versions are mostly compatible. Some things

Re: top-k function for Window

2017-01-03 Thread Koert Kuipers
i dont know anything about windowing or about not using developer apis... but but a trivial implementation of top-k requires a total sort per group. this can be done with dataset. we do this using spark-sorted ( https://github.com/tresata/spark-sorted) but its not hard to do it yourself for

Re: [Spark Kafka] How to update batch size of input dynamically for spark kafka consumer?

2017-01-03 Thread Cody Koeninger
You can't change the batch time, but you can limit the number of items in the batch http://spark.apache.org/docs/latest/configuration.html spark.streaming.backpressure.enabled spark.streaming.kafka.maxRatePerPartition On Tue, Jan 3, 2017 at 4:00 AM, 周家帅 wrote: > Hi, > > I am

Re: top-k function for Window

2017-01-03 Thread Andy Dang
Hi Austin, It's trivial to implement top-k in the RDD world - however I would like to stay in the Dataset API world instead of flip-flopping between the two APIs (consistency, wholestage codegen etc). The twitter library appears to support only RDD, and the solution you gave me is very similar

Re: top-k function for Window

2017-01-03 Thread HENSLEE, AUSTIN L
Andy, You might want to also checkout the Algebird libraries from Twitter. They have topK and a lot of other helpful functions. I’ve used the Algebird topk successfully on very large data sets. You can also use Spark SQL to do a “poor man’s” topK. This depends on how scrupulous you are about

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Ahh I see what you meanI confused two terminologiesbecause we were talking about partitioning and then changed topic to identify changed data For that, you can "construct" a dbtable as an inline view - viewSQL = "(select * from table where >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Yeah, I understand your proposal, but according to here http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases, it says Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Hi You need to store and capture the Max of the column you intend to use for identifying new records (Ex: INSERTED_ON) after every successful run of your job. Then, use the value in lowerBound option. Essentially, you want to create a query like select * from table where INSERTED_ON >

Mysql table upsate in spark

2017-01-03 Thread Santlal J Gupta
Hi, I am new to spark and scala development. I want to update Mysql table using spark for my poc. Scenario : Mysql table myCity: namecity 1 2 I want to update this table with below values :

RBackendHandler Error while running ML algorithms with SparkR on RStudio

2017-01-03 Thread Md. Rezaul Karim
Dear Spark Users, I was trying to execute RandomForest and NaiveBayes algorithms on RStudio but experiencing the following error: 17/01/03 15:04:11 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed java.lang.reflect.InvocationTargetException at

Storage history in web UI

2017-01-03 Thread Joseph Naegele
Hi all, Is there any way to observe Storage history in Spark, i.e. which RDDs were cached and where, etc. after an application completes? It appears the Storage tab in the History Server UI is useless. Thanks --- Joe Naegele Grier Forensics

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Thanks a lot for your suggestion. I am currently looking into sqoop. Concerning your suggestion for Spark, it is indeed parallelized with multiple workers, but the job is one-off and cannot keep streaming. Moreover, I cannot specify any "start row" in the job, it will always ingest the

Re: can UDF accept "Any"/"AnyVal"/"AnyRef"(java Object) as parameter or as return type ?

2017-01-03 Thread Koert Kuipers
spark sql is "runtime strongly typed" meaning it must know the actual type. so this will not work On Jan 3, 2017 07:46, "Linyuxin" wrote: > Hi all > > *With Spark 1.5.1* > > > > *When I want to implement a oracle decode function (like > decode(col1,1,’xxx’,’p2’,’yyy’,0))* >

Re: Broadcast Join and Inner Join giving different result on same DataFrame

2017-01-03 Thread ayan guha
I think productBroadcastDF is broadcast variable in your case, not the DF itself. Try the join with productBroadcastDF.value On Wed, Jan 4, 2017 at 1:04 AM, Patrick wrote: > Hi, > > An Update on above question: In Local[*] mode code is working fine. The > Broadcast size

RE: top-k function for Window

2017-01-03 Thread Mendelson, Assaf
You can write a UDAF in which the buffer contains the top K and manage it. This means you don’t need to sort at all. Furthermore, in your example you don’t even need a window function, you can simply use groupby and explode. Of course, this is only relevant if k is small… From: Andy Dang

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Hi While the solutions provided by others looks promising and I'd like to try out few of them, our old pal sqoop already "does" the job. It has a incremental mode where you can provide a --check-column and --last-modified-value combination to grab the data - and yes, sqoop essentially does it by

Re: Broadcast Join and Inner Join giving different result on same DataFrame

2017-01-03 Thread Patrick
Hi, An Update on above question: In Local[*] mode code is working fine. The Broadcast size is 200MB, but on Yarn it the broadcast join is giving empty result.But in Sql Query in UI, it does show BroadcastHint. Thanks On Fri, Dec 30, 2016 at 9:15 PM, titli batali wrote:

top-k function for Window

2017-01-03 Thread Andy Dang
Hi all, What's the best way to do top-k with Windowing in Dataset world? I have a snippet of code that filters the data to the top-k, but with skewed keys: val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime) val rank = row_number().over(windowSpec) input.withColumn("rank",

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Tamas, Thanks a lot for your suggestion! I will also investigate this one later. Best regards, Yang 2017-01-03 12:38 GMT+01:00 Tamas Szuromi : > > You can also try https://github.com/zendesk/maxwell > > Tamas > > On 3 January 2017 at 12:25, Amrit Jangid

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Amrit, Thanks a lot for your suggestion! I will investigate it later. Best regards, Yang 2017-01-03 12:25 GMT+01:00 Amrit Jangid : > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and stream into Kafka. >

Re: How to load a big csv to dataframe in Spark 1.6

2017-01-03 Thread Steve Loughran
On 31 Dec 2016, at 16:09, Raymond Xie > wrote: Hello Felix, I followed the instruction and ran the command: > $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 and I received the following error message:

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Thanks a lot for such a detailed response. I really appreciate it! I think this use case can be generalized, because the data is immutable and append-only. We only need to find one column or timestamp to track the last row consumed in the previous ingestion. This pattern should be

Re: Question about Spark and filesystems

2017-01-03 Thread Steve Loughran
On 18 Dec 2016, at 19:50, joa...@verona.se wrote: Since each Spark worker node needs to access the same files, we have tried using Hdfs. This worked, but there were some oddities making me a bit uneasy. For dependency hell reasons I compiled a modified Spark, and this

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Tamas Szuromi
You can also try https://github.com/zendesk/maxwell Tamas On 3 January 2017 at 12:25, Amrit Jangid wrote: > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and stream into Kafka. > > Now Kafka can be your new

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Amrit Jangid
You can try out *debezium* : https://github.com/debezium. it reads data from bin-logs, provides structure and stream into Kafka. Now Kafka can be your new source for streaming. On Tue, Jan 3, 2017 at 4:36 PM, Yuanzhe Yang wrote: > Hi Hongdi, > > Thanks a lot for your

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Hongdi, Thanks a lot for your suggestion. The data is truely immutable and the table is append-only. But actually there are different databases involved, so the only feature they share in common and I can depend on is jdbc... Best regards, Yang 2016-12-30 6:45 GMT+01:00 任弘迪

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Michael, Thanks a lot for your ticket. At least it is the first step. Best regards, Yang 2016-12-30 2:01 GMT+01:00 Michael Armbrust : > We don't support this yet, but I've opened this JIRA as it sounds > generally useful:

[Spark Kafka] How to update batch size of input dynamically for spark kafka consumer?

2017-01-03 Thread 周家帅
Hi, I am an intermediate spark user and have some experience in large data processing. I post this question in StackOverflow but receive no response. My problem is as follows: I use createDirectStream in my spark streaming application. I set the batch interval to 7 seconds and most of the time

Re: Re: Re: Spark Streaming prediction

2017-01-03 Thread Marco Mistroni
Hi ok then my suggestion stays.Check out ML you can train your ML model on past data (let's say, either yesteday or past x days) to have Spark find out what is the relation betwen the value you have at T-zero and the value you have at T+n hours and you can try ml outside your. Streaming app by