Re: Reading parquet files into Spark Streaming

2016-08-26 Thread Akhilesh Pathodia
Hi Renato, Which version of Spark are you using? If spark version is 1.3.0 or more then you can use SqlContext to read the parquet file which will give you DataFrame. Please follow the below link: https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
Pyspark example based on the data you provided (obviously your dataframes will come from whatever source you have, not entered directly). This uses an intermediary dataframe with grouped data for clarity, but you could pull this off in other ways. -- Code -- from pyspark.sql.types import * from

Re: Dynamically change executors settings

2016-08-26 Thread linguin . m . s
Hi, No, currently you can't change the setting. // maropu 2016/08/27 11:40、Vadim Semenov のメッセージ: > Hi spark users, > > I wonder if it's possible to change executors settings on-the-fly. > I have the following use-case: I have a lot of non-splittable skewed

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Mike Metzger
I would also suggest building the container manually first and setup everything you specifically need. Once done, you can then grab the history file, pull out the invalid commands and build out the completed Dockerfile. Trying to troubleshoot an installation via Dockerfile is often an exercise

Dynamically change executors settings

2016-08-26 Thread Vadim Semenov
Hi spark users, I wonder if it's possible to change executors settings on-the-fly. I have the following use-case: I have a lot of non-splittable skewed files in a custom format that I read using a custom Hadoop RecordReader. These files can be small & huge and I'd like to use only one-two cores

Re: mutable.LinkedHashMap kryo serialization issues

2016-08-26 Thread Rahul Palamuttam
Hi, I apologize, I spoke too soon. Those transient member variables may not be the issue. To clarify my test case I am creating a LinkedHashMap with two elements in a map expression on an RDD. Note that the LinkedHashMaps are being created on the worker JVMs (not the driver JVM) and THEN

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it: https://github.com/graphframes/graphframes/releases/tag/release-0.2.0 Thanks for the reminder! Joseph On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński wrote: > Hi, > Do you plan to add tag for this release on github ? >

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-26 Thread Mich Talebzadeh
Thanks Jacek, I will have a look. I think it is long overdue. I mean we try to micro batch and stream everything below seconds but when it comes to help monitor basics we are still miles behind :( Cheers, Dr Mich Talebzadeh LinkedIn *

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-26 Thread Jacek Laskowski
Hi Mich, I don't think so. There is support for a UI page refresh but I haven't seen it in use. See StreamingPage [1] where it schedules refresh every 5 secs, i.e. Some(5000). In SparkUIUtils.headerSparkPage [2] there is refreshInterval but it's not used in any place in Spark. Time to fill an

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Michael Gummelt
Run with "-X -e" like the error message says. See what comes out. On Fri, Aug 26, 2016 at 2:23 PM, Tal Grynbaum wrote: > Did you specify -Dscala-2.10 > As in > ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 > -Dscala-2.10 -DskipTests clean package >

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
What's happening to my English too many Typo's sorry. Let me rephrase it HTTP2 for fully pipelined out of order execution. other words I should be able to send multiple requests through same TCP connection and by out of order execution I mean if I send Req1 at t1 and Req2 at t2 where t1 < t2 and

Re: Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Anybody? I think Rory also didn't get an answer from the list ... https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com>: >

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
HTTP2 for fully pipelined out of order execution. other words I should be able to send multiple requests through same TCP connection and by out of order execution I mean if I send Req1 at t1 and Req2 at t2 where t1 < t2 and if Req 2 finishes before Req1 I should be able to get a response from

RE: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Subhajit Purkayastha
So the data in the fcst dataframe is like this Product, fcst_qty A 100 B 50 Sales DF has data like this Order# Item#Sales qty 101 A 10 101 B 5 102 A 5 102 B 10 I want

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Tal Grynbaum
Did you specify -Dscala-2.10 As in ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package If you're building with scala 2.10 On Sat, Aug 27, 2016, 00:18 Marco Mistroni wrote: > Hello Michael > uhm i celebrated too soon

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread Jacek Laskowski
Hi, Never heard of one myself. I don't think Bahir [1] offers it, either. Perhaps socketTextStream or textFileStream with http URI could be of some help? What would you expect from such a HTTP/2 receiver? What are the requirements? Why http/2? #curious [1] http://bahir.apache.org/ Pozdrawiam,

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Marco Mistroni
Hello Michael uhm i celebrated too soon Compilation of spark on docker image went near the end and then it errored out with this message INFO] BUILD FAILURE [INFO] [INFO] Total time: 01:01 h [INFO] Finished at:

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
Without seeing exactly what you were wanting to accomplish, it's hard to say. A Join is still probably the method I'd suggest using something like: select (FCST.quantity - SO.quantity) as quantity from FCST LEFT OUTER JOIN SO ON FCST.productid = SO.productid WHERE with specifics depending on

Re: Spark 1.6 Streaming with Checkpointing

2016-08-26 Thread Jacek Laskowski
On Fri, Aug 26, 2016 at 10:54 PM, Benjamin Kim wrote: > // Create a text file stream on an S3 bucket > val csv = ssc.textFileStream("s3a://" + awsS3BucketName + "/") > > csv.foreachRDD(rdd => { > if (!rdd.partitions.isEmpty) { >

Spark 1.6 Streaming with Checkpointing

2016-08-26 Thread Benjamin Kim
I am trying to implement checkpointing in my streaming application but I am getting a not serializable error. Has anyone encountered this? I am deploying this job in YARN clustered mode. Here is a snippet of the main parts of the code. object S3EventIngestion { //create and setup streaming

RE: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Subhajit Purkayastha
Mike, The grains of the dataFrame are different. I need to reduce the forecast qty (which is in the FCST DF) based on the sales qty (coming from the sales order DF) Hope it helps Subhajit From: Mike Metzger [mailto:m...@flexiblecreations.com] Sent: Friday, August 26, 2016

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
Without seeing the makeup of the Dataframes nor what your logic is for updating them, I'd suggest doing a join of the Forecast DF with the appropriate columns from the SalesOrder DF. Mike On Fri, Aug 26, 2016 at 11:53 AM, Subhajit Purkayastha wrote: > I am using spark 2.0,

is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
is there a HTTP2 (v2) endpoint for Spark Streaming?

Re: unable to start slaves from master (SSH problem)

2016-08-26 Thread kant kodali
Fixed..I just had to logout and login the master node for some reason On Fri, Aug 26, 2016 5:32 AM, kant kodali kanth...@gmail.com wrote: Hi, I am unable to start spark slaves from my master node. when I run ./start-all.sh in my master node it brings up the master and but fails for slaves

Re: spark 2.0 home brew package missing

2016-08-26 Thread RAJESHWAR MANN
Thank you! That was it. 2.0 installed fine after the update. Regards > On Aug 26, 2016, at 1:37 PM, Noorul Islam K M wrote: > > kalkimann writes: > >> Hi, >> spark 1.6.2 is the latest brew package i can find. >> spark 2.0.x brew package is missing,

Re: EMR for spark job - instance type suggestion

2016-08-26 Thread Gavin Yue
I tried both M4 and R3. R3 is slightly more expensive, but has larger memory. If you doing a lot of in-memory staff, like Join. I recommend R3. Otherwise M4 is fine. Also I remember M4 is EBS instance, so you have to pay for additional EBS cost as well. On Fri, Aug 26, 2016 at 10:29 AM,

Re: spark 2.0 home brew package missing

2016-08-26 Thread Noorul Islam K M
kalkimann writes: > Hi, > spark 1.6.2 is the latest brew package i can find. > spark 2.0.x brew package is missing, best i know. > > Is there a schedule when spark-2.0 will be available for "brew install"? > Did you do a 'brew update' before searching. I installed

EMR for spark job - instance type suggestion

2016-08-26 Thread Saurabh Malviya (samalviy)
We are going to use EMR cluster for spark jobs in aws. Any suggestion for instance type to be used. M3.xlarge or r3.xlarge. Details: 1) We are going to run couple of streaming jobs so we need on demand instance type. 2) There is no data on hdfs/s3 all data pull from kafka or

spark 2.0 home brew package missing

2016-08-26 Thread kalkimann
Hi, spark 1.6.2 is the latest brew package i can find. spark 2.0.x brew package is missing, best i know. Is there a schedule when spark-2.0 will be available for "brew install"? Thanks -- View this message in context:

Fwd: Populating tables using hive and spark

2016-08-26 Thread Timur Shenkao
Hello! I just wonder: do you (both of you) use the same user for HIVE & Spark? Or different ? Do you use Kerberized Hadoop? On Mon, Aug 22, 2016 at 2:20 PM, Mich Talebzadeh wrote: > Ok This is my test > > 1) create table in Hive and populate it with two rows > >

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Michael Gummelt
:) On Thu, Aug 25, 2016 at 2:29 PM, Marco Mistroni wrote: > No i wont accept that :) > I can't believe i have wasted 3 hrs for a space! > > Many thanks MIchael! > > kr > > On Thu, Aug 25, 2016 at 10:01 PM, Michael Gummelt > wrote: > >> You have a

Re: mutable.LinkedHashMap kryo serialization issues

2016-08-26 Thread Rahul Palamuttam
Thanks Renato. I forgot to reply all last time. I apologize for the rather confusing example. All that the snipet code did was 1. Make an RDD of LinkedHashMaps with size 2 2. On the worker side get the sizes of the HashMaps (via a map(hash => hash.size)) 3. On the driver call collect on the

Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Subhajit Purkayastha
I am using spark 2.0, have 2 DataFrames, SalesOrder and Forecast. I need to update the Forecast Dataframe record(s), based on the SaleOrder DF record. What is the best way to achieve this functionality

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran
On 26 Aug 2016, at 12:58, kant kodali > wrote: @Steve your arguments make sense however there is a good majority of people who have extensive experience with zookeeper prefer to avoid zookeeper and given the ease of consul (which btw uses raft for

Spark driver memory breakdown

2016-08-26 Thread Mich Talebzadeh
Hi, I alwayd underestimated the significant of setting spark.driver.memory According to documents It is the amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g). I was running my application using Spark Standalone so the argument about Local

Re: zookeeper mesos logging in spark

2016-08-26 Thread Michael Gummelt
These are the libmesos logs. Maybe look here http://mesos.apache.org/documentation/latest/logging/ On Fri, Aug 26, 2016 at 8:31 AM, aecc wrote: > Hi, > > Everytime I run my spark application using mesos, I get logs in my console > in the form: > > 2016-08-26

Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi all, I am trying to use parquet files as input for DStream operations, but I can't find any documentation or example. The only thing I found was [1] but I also get the same error as in the post (Class parquet.avro.AvroReadSupport not found). Ideally I would like to do have something like this:

Re: Insert non-null values from dataframe

2016-08-26 Thread Russell Spitzer
Cassandra does not differentiate between null and empty, so when reading from C* all empty values are reported as null. To avoid inserting nulls (avoiding tombstones) see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#globally-treating-all-nulls-as-unset This

Re: How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Xinh Huynh
That looks like a pivot table. Have you looked into using the pivot table method with DataFrames? Xinh > On Aug 26, 2016, at 4:54 AM, Rex X wrote: > > 1. Given following CSV file > $cat data.csv > > ID,City,Zip,Price,Rating > 1,A,95123,100,0 > 1,B,95124,102,1 >

zookeeper mesos logging in spark

2016-08-26 Thread aecc
Hi, Everytime I run my spark application using mesos, I get logs in my console in the form: 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26

Re: Best way to calculate intermediate column statistics

2016-08-26 Thread Mich Talebzadeh
Hi Bedrytski, I assume you are referring to my code above. The alternative SQL would be (the first code with rank) SELECT * FROM ( SELECT transactiondate, transactiondescription, debitamount , RANK() OVER (ORDER BY transactiondate desc) AS rank FROM WHERE

Re: Best way to calculate intermediate column statistics

2016-08-26 Thread Bedrytski Aliaksandr
Hi Mich, I was wondering what are the advantages of using helper methods instead of one SQL multiline string? (I rarely (if ever) use helper methods, but maybe I'm missing something) Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh wrote: >

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
@Mich ofcourse and In my previous message I have given a context as well. Needless to say, the tools that are used by many banks that I came across such as Citi, Capital One, Wells Fargo, GSachs are pretty laughable when it comes to compliance and security. They somehow think they are secure when

Re: How to do this pairing in Spark?

2016-08-26 Thread ayan guha
Top of head select *from (Select ID, flag, lead(id) over(partition by city,zip order by flag,ID) c from t) Where id==0 and c is not null Should do it. Basically you want to keep records which has ID 0 and have a corresponding 1. Please let me know if doesn't work, so I can provide a right

Re: How to install spark with s3 on AWS?

2016-08-26 Thread kant kodali
Hmm do I always need to have that in my driver program? Why can't I set it somewhere such that spark cluster realizes that is needs to use s3? On Fri, Aug 26, 2016 5:13 AM, Devi P.V devip2...@gmail.com wrote: The following piece of code works for me to read data from S3 using Spark. val

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Carlile, Ken
We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3. —Ken On Aug 25, 2016, at 8:02 PM,

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Mich Talebzadeh
And yes any technology needs time for maturity but that said it shouldn't stop us from transitioning Depends on the application and how mission critical the business it is deployed for. If you are using a tool for a Bank's Credit Risk (Surveillance, Anti-Money Laundering, Employee

unable to start slaves from master (SSH problem)

2016-08-26 Thread kant kodali
Hi, I am unable to start spark slaves from my master node. when I run ./start-all.sh in my master node it brings up the master and but fails for slaves saying "permission denied public key" for slaves but I did add the master id_rsa.pub to my slaves authorized_keys and I checked manually from my

Re: How to install spark with s3 on AWS?

2016-08-26 Thread Devi P.V
The following piece of code works for me to read data from S3 using Spark. val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val hadoopConf=sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
@Steve your arguments make sense however there is a good majority of people who have extensive experience with zookeeper prefer to avoid zookeeper and given the ease of consul (which btw uses raft for the election) and etcd lot of us are more inclined to avoid ZK. And yes any technology needs

Re: How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Rex X
The data.csv need to be corrected: 1. Given following CSV file $cat data.csv ID,City,Zip,Price,Rating 1,A,95123,100,1 1,B,95124,102,2 1,A,95126,100,2 2,B,95123,200,1 2,B,95124,201,2 2,C,95124,203,1 3,A,95126,300,2 3,C,95124,280,1 4,C,95124,400,2 On Fri, Aug 26, 2016 at 4:54 AM, Rex X

How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Rex X
1. Given following CSV file $cat data.csv ID,City,Zip,Price,Rating1,A,95123,100,01,B,95124,102,11,A,95126,100,12,B,95123,200,02,B,95124,201,12,C,95124,203,03,A,95126,300,13,C,95124,280,04,C,95124,400,1 We want to group by ID, and make new composite columns of Price and Rating based on the

How to install spark with s3 on AWS?

2016-08-26 Thread kant kodali
Hi guys, Are there any instructions on how to setup spark with S3 on AWS? Thanks!

Re: How to do this pairing in Spark?

2016-08-26 Thread Rex X
Hi Ayan, Yes, ID=3 can be paired with ID=1, and the same for ID=9 with ID=8. BUT we want to keep only ONE pair for the ID with Flag=0. Since ID=1 with Flag=0 already paired with ID=2, and ID=8 paired with ID=7, we simply delete ID=3 and ID=9. Thanks! Regards, Rex On Fri, Aug 26, 2016 at

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran
On 25 Aug 2016, at 22:49, kant kodali > wrote: yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this.

Re: How to do this pairing in Spark?

2016-08-26 Thread ayan guha
Why 3 and 9 should be deleted? 3 can be paired with 1and 9 can be paired with 8. On 26 Aug 2016 11:00, "Rex X" wrote: > 1. Given following CSV file > > > $cat data.csv > > > > ID,City,Zip,Flag > > 1,A,95126,0 > > 2,A,95126,1 > > 3,A,95126,1 > >

Re: mutable.LinkedHashMap kryo serialization issues

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi Rahul, You have probably already figured this one out, but anyway... You need to register the classes that you'll be using with Kryo because it does not support all Serializable types and requires you to register the classes you’ll use in the program in advance. So when you don't register the