Re: Kafka directsream receiving rate

2016-02-06 Thread Diwakar Dhanuskodi
Thanks  Cody for  trying to  understand the  issue . Sorry if  I am  not  clear . The scenario  is  to  process all messages at once  in  single  dstream block   when  source  system  publishes messages .Source  system  will  publish x messages  / 10 minutes  once. By events I meant that  total

Fwd: Question on how to access tuple values in spark

2016-02-06 Thread mdkhajaasmath
> Hi, > > My req is to find max value of revenue per customer so I am using below > query. I got this solution from one of tutorial in google but not able to > understand how it returns max in this scenario. can anyone hep > > revenuePerDayPerCustomerMap.reduceByKey((x, y) => (if(x._2 >=

Re: Question on how to access tuple values in spark

2016-02-06 Thread mdkhajaasmath
Sent from my iPhone > On Feb 6, 2016, at 4:41 PM, KhajaAsmath Mohammed > wrote: > > Hi, > > My req is to find max value of revenue per customer so I am using below > query. I got this solution from one of tutorial in google but not able to > understand how it

Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Andrew Holway
I'm managing to read data via JDBC using the following but I can't work out how to write something back to the Database. df <- read.df(sqlContext, source="jdbc", url="jdbc:mysql://hostname:3306?user=user=pass", dbtable="database.table") Does this functionality exist in 1.5.2? Thanks,

Re: Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Andrew Holway
> > df <- read.df(sqlContext, source="jdbc", > url="jdbc:mysql://hostname:3306?user=user=pass", > dbtable="database.table") > I got a bit further but am now getting the following error. This error is being thrown without the database being touched. I tested this by making the database

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Corey Nolet
The whole purpose of Apache mailing lists is that the messages get indexed all over the web so that discussions and questions/solutions can be searched easily by google and other engines. For this reason, and the messages being sent via email as Steve pointed out, it's just not possible to

Spark Streaming with Druid?

2016-02-06 Thread unk1102
Hi did anybody tried Spark Streaming with Druid as low latency store? Combination seems powerful is it worth trying both together? Please guide and share your experience. I am after creating the best low latency streaming analytics. -- View this message in context:

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Steve Loughran
> On 5 Feb 2016, at 17:35, Marcelo Vanzin wrote: > > You don't... just send a new one. > > On Fri, Feb 5, 2016 at 9:33 AM, swetha kasireddy > wrote: >> Hi, >> >> I want to edit/delete a message posted in Spark User List. How do I do that? >>

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-06 Thread Udo Fholl
Sorry I realized that I left a bit of the last email. This is the only BLOCKED thread in the dump. Refence handler is blocked most likely due to the GC running at the moment of the dump. "Reference Handler" daemon prio=10 tid=2 BLOCKED at java.lang.Object.wait(Native Method) at

Re: Kafka directsream receiving rate

2016-02-06 Thread Cody Koeninger
I am not at all clear on what you are saying. "Yes , I am printing each messages . It is processing all messages under each dstream block." If it is processing all messages, what is the problem you are having? "The issue is with Directsream processing 10 message per event. " What

Re: Shuffle memory woes

2016-02-06 Thread Corey Nolet
Igor, Thank you for the response but unfortunately, the problem I'm referring to goes beyond this. I have set the shuffle memory fraction to be 90% and set the cache memory to be 0. Repartitioning the RDD helped a tad on the map side but didn't do much for the spilling when there was no longer

Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-06 Thread Li Ming Tsai
Hi, I did more investigation and found out that BLAS.scala is calling the native reference architecture (f2jblas) for level 1 routines. I even patched it to use nativeBlas.ddot but it has no material impact.

Apache Spark data locality when integrating with Kafka

2016-02-06 Thread fanooos
Dears If I will use Kafka as a streaming source to some spark jobs, is it advised to install spark to the same nodes of kafka cluster? What are the benefits and drawbacks of such a decision? regards -- View this message in context:

RE: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Diwakar Dhanuskodi
Yes . To  reduce  network  latency . Sent from Samsung Mobile. Original message From: fanooos Date:07/02/2016 09:24 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Apache Spark data locality when integrating with Kafka Dears If I will use

Re: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Koert Kuipers
spark can benefit from data locality and will try to launch tasks on the node where the kafka partition resides. however i think in production many organizations run a dedicated kafka cluster. On Sat, Feb 6, 2016 at 11:27 PM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Yes . To

Imported CSV file content isn't identical to the original file

2016-02-06 Thread SLiZn Liu
Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame. Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30

Re: Bad Digest error while doing aws s3 put

2016-02-06 Thread Dhimant
Hi , I am getting the following error while reading the huge data from S3 and after processing ,writing data to S3 again. Did you find any solution for this ? 16/02/07 07:41:59 WARN scheduler.TaskSetManager: Lost task 144.2 in stage 3.0 (TID 169, ip-172-31-7-26.us-west-2.compute.internal):

Re: different behavior while using createDataFrame and read.df in SparkR

2016-02-06 Thread Devesh Raj Singh
Thank you ! Rui Sun for the observation! It helped. I have a new problem arising. When I create a small function for dummy variable creation for categorical column BDADummies<-function(dataframe,column){ cat.column<-vector(mode="character",length=nrow(dataframe)) cat.column<-collect(column)

Re: Kafka directsream receiving rate

2016-02-06 Thread Diwakar Dhanuskodi
Cody,  Yes , I am  printing  each  messages . It is  processing all  messages under each  dstream block. Source systems are   publishing  1 Million messages /4 secs which is less than batch interval. The issue is  with  Directsream processing 10 message per event. When partitions were  

Re: Shuffle memory woes

2016-02-06 Thread Igor Berman
Hi, usually you can solve this by 2 steps make rdd to have more partitions play with shuffle memory fraction in spark 1.6 cache vs shuffle memory fractions are adjusted automatically On 5 February 2016 at 23:07, Corey Nolet wrote: > I just recently had a discovery that my