Futures timed out after [120 seconds]

2016-02-08 Thread Andrew Milkowski
Hello, have question , we seeing below exceptions, and at the moment are enabling JVM profiler to look into gc activity on workers and if you have any other suggestions do let know please , we dont just want increase rpc timeout (from 120) to 600 sec lets say but get to reason why workers timeout

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-08 Thread Shipper, Jay [USA]
I looked back into this today. I made some changes last week to the application to allow for not only compatibility with Spark 1.5.2, but also backwards compatibility with Spark 1.4.1 (the version our current deployment uses). The changes mostly involved changing dependencies from compile to

Access batch statistics in Spark Streaming

2016-02-08 Thread Chen Song
Apologize in advance if someone has already asked and addressed this question. In Spark Streaming, how can I programmatically get the batch statistics like schedule delay, total delay and processing time (They are shown in the job UI streaming tab)? I need such information to raise alerts in some

Re: Spark Streaming with Druid?

2016-02-08 Thread Umesh Kacha
Hi Hemant, thanks much can we use SnappyData on YARN. My Spark jobs run using yarn client mode. Please guide. On Mon, Feb 8, 2016 at 9:46 AM, Hemant Bhanawat wrote: > You may want to have a look at spark druid project already in progress: >

Re: Shuffle memory woes

2016-02-08 Thread Igor Berman
It's interesting to see what spark dev people will say. Corey do you have presentation available online? On 8 February 2016 at 05:16, Corey Nolet wrote: > Charles, > > Thank you for chiming in and I'm glad someone else is experiencing this > too and not just me. I know very

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
Thanks Luciano, now it looks like I’m the only guy who have this issue. My options is narrowed down to upgrade my spark to 1.6.0, to see if this issue is gone. — Cheers, Todd Leo ​ On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende wrote: > I tried in both 1.5.0, 1.6.0 and

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
I’ve found the trigger of my issue: if I start my spark-shell or submit by spark-submit with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame content goes wrong, as I described earlier. ​ On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu wrote: >

How to see Cassandra List / Set / Map values from Spark Hive Thrift JDBC?

2016-02-08 Thread Matthew Johnson
Hi all, I have asked this question here on StackOverflow: http://stackoverflow.com/questions/35222365/spark-sql-hivethriftserver2-get-liststring-from-cassandra-in-squirrelsql But hoping I get more luck from this group. When I write a Java SparkSQL application to query a

Extract all the values from describe

2016-02-08 Thread Arunkumar Pillai
hi I have a dataframe df and i use df.decribe() to get the stats value. but not able to parse and extract all the individual information. Please help -- Thanks and Regards Arun

Re: how can i write map(x => x._1 ++ x._2) statement in python.??

2016-02-08 Thread Yuval.Itzchakov
In python, concatenating two lists can be done simply using the + operator. I'm assuming the RDD you're using map over consists of a tuple: map(lambda x: x[0] + x[1]) -- View this message in context:

Re: Kafka directsream receiving rate

2016-02-08 Thread Diwakar Dhanuskodi
Now, using  DirectStream   I am  able  to  process  2 Million messages from 20 partition topic  in a batch  interval of  2000ms. Finally  figured  out  that  Kafka producer from  a source system  is  sending   same  topic  name  instead  of  key in  keyedmessage . It  could  put  messages  

[Spark 1.5.1] percentile in spark

2016-02-08 Thread Arunkumar Pillai
Hi I'm using sql query find the percentile value. Is there any pre defined functions for percentile calculation -- Thanks and Regards Arun

Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
Hi All, A long running Spark job on YARN throws below exception after running for few days. yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row. org.apache.hadoop.yarn.exceptions.YarnException: *No AMRMToken found* for user prabhu at

Re: Spark Streaming with Druid?

2016-02-08 Thread Hemant Bhanawat
SnappyData's deployment is different that how Spark is deployed. See http://snappydatainc.github.io/snappydata/deployment/ and http://snappydatainc.github.io/snappydata/jobs/. For further questions, you can join us on stackoverflow http://stackoverflow.com/questions/tagged/snappydata. Hemant

Re: Optimal way to re-partition from a single partition

2016-02-08 Thread Takeshi Yamamuro
Hi, Plz use DataFrame#repartition. On Tue, Feb 9, 2016 at 7:30 AM, Cesar Flores wrote: > > I have a data frame which I sort using orderBy function. This operation > causes my data frame to go to a single partition. After using those > results, I would like to re-partition to

Re: Please help with external package using --packages option in spark-shell

2016-02-08 Thread Jeff - Data Bean Australia
Finally I figured out the problem and fixed it. There was some inconsistency in my .ivy2 and .m2 repository. Spark resolves the dependencies using meta data in ivy2/cache by not verifies its real location. That was why Spark resolved jackson-core-asl in local-m2-cache. But when Spark tried to

Re: Bad Digest error while doing aws s3 put

2016-02-08 Thread lmk
Hi Dhimant, As I had indicated in my next mail, my problem was due to disk getting full with log messages (these were dumped into the slaves) and did not have anything to do with the content pushed into s3. So, looks like this error message is very generic and is thrown for various reasons. You

RE: different behavior while using createDataFrame and read.df in SparkR

2016-02-08 Thread Sun, Rui
I guess the problem is: dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) ) dataframe<-dummy.df Once dataframe is re-assigned to reference a new DataFrame in each iteration, the column variable has to be

Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
+ Spark-Dev On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph wrote: > Hi All, > > A long running Spark job on YARN throws below exception after running > for few days. > > yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row. >

Re: ALS rating caching

2016-02-08 Thread Nick Pentreath
In the "new" ALS intermediate RDDs (including the ratings input RDD after transforming to block-partitioned ratings) is cached using intermediateRDDStorageLevel, and you can select the final RDD storage level (for user and item factors) using finalRDDStorageLevel. The old MLLIB API now calls the

[Spark Streaming] Spark Streaming dropping last lines

2016-02-08 Thread Nipun Arora
I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command tail -f | nc -lk Here the spark streaming service is taking data from

Re: Extract all the values from describe

2016-02-08 Thread James Barney
Hi Arunkumar, >From the scala documentation it's recommended to use the agg function for performing any actual statistics programmatically on your data. df.describe() is meant only for data exploration. See Aggregator here:

Re: Shuffle memory woes

2016-02-08 Thread Corey Nolet
I sure do! [1] And yes- I'm really hoping they will chime in, otherwise I may dig a little deeper myself and start posting some jira tickets. [1] http://www.slideshare.net/cjnolet On Mon, Feb 8, 2016 at 3:02 AM, Igor Berman wrote: > It's interesting to see what spark dev

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread Luciano Resende
Sorry, same expected results with trunk and Kryo serializer On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu wrote: > I’ve found the trigger of my issue: if I start my spark-shell or submit > by spark-submit with --conf >

Dynamically Change Log Level Spark Streaming

2016-02-08 Thread Ashish Soni
Hi All , How do change the log level for the running spark streaming Job , Any help will be appriciated. Thanks,

Example of onEnvironmentUpdate Listener

2016-02-08 Thread Ashish Soni
Are there any examples as how to implement onEnvironmentUpdate method for customer listener Thanks,

ErrorToken illegal character in a query having / @ $ . symbols

2016-02-08 Thread Mohamed Nadjib MAMI
Hello all, Could someone please help me figure out what wrong with my query that I'm running over Parquet tables? the query has the following form: weird_query = "SELECT a._example.com/aa/1.1/aa_, b._example.com/bb/1.2/bb_ FROM www$aa@aa a LEFT JOIN www$bb@bb b ON

Re: Access batch statistics in Spark Streaming

2016-02-08 Thread Bryan Jeffrey
>From within a Spark job you can use a Periodic Listener: ssc.addStreamingListener(PeriodicStatisticsListener(Seconds(60))) class PeriodicStatisticsListener(timePeriod: Duration) extends StreamingListener { private val logger = LoggerFactory.getLogger("Application") override def

Spark LBFGS Error with ANN

2016-02-08 Thread Hayri Volkan Agun
I am using Multilayer Percertron Classifier. In each training instance there are multiple 1.0 in the ouput vector of the Multilayer Perceptron Classifier. This is necessary. With small number of training data I am getting the following error *ERROR LBFGS: Failure again! Giving up and returning.

Re: Bad Digest error while doing aws s3 put

2016-02-08 Thread Eugen Cepoi
I had similar problems with multi part uploads. In my case the real error was something else which was being masked by this issue https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad digest exception was a side effect and not the original issue. For me it was some library version

Spark in Production - Use Cases

2016-02-08 Thread Scott walent
Spark Summit East is just 10 days away and we are almost sold out! One of the highlights this year will focus on how Spark is being used across businesses to solve both big and small data needs. Check out the full agenda here: https://spark-summit.org/east-2016/schedule/ Use "ApacheList" for 30%

Optimal way to re-partition from a single partition

2016-02-08 Thread Cesar Flores
I have a data frame which I sort using orderBy function. This operation causes my data frame to go to a single partition. After using those results, I would like to re-partition to a larger number of partitions. Currently I am just doing: val rdd = df.rdd.coalesce(100, true) //df is a dataframe

LogisticRegressionModel not able to load serialized model from S3

2016-02-08 Thread Utkarsh Sengar
I am storing a model in s3 in this path: "bucket_name/p1/models/lr/20160204_0410PM/ser" and the structure of the saved dir looks like this: 1. bucket_name/p1/models/lr/20160204_0410PM/ser/data -> _SUCCESS, _metadata, _common_metadata and

ALS rating caching

2016-02-08 Thread Roberto Pagliari
When using ALS from mllib, would it be better/recommended to cache the ratings RDD? I'm asking because when predicting products for users (for example) it is recommended to cache product/user matrices. Thank you,

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
At least works for me though, temporarily disabled Kyro serilizer until upgrade to 1.6.0. Appreciate for your update. :) Luciano Resende 于2016年2月9日 周二02:37写道: > Sorry, same expected results with trunk and Kryo serializer > > On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu