Maintaining order of pair rdd

2016-07-23 Thread janardhan shetty
I have a key,value pair rdd where value is an array of Ints. I need to maintain the order of the value in order to execute downstream modifications. How do we maintain the order of values? Ex: rdd = (id1,[5,2,3,15], Id2,[9,4,2,5]) Followup question how do we compare between one element in rdd

SaveToCassandra executed when I stop Spark

2016-07-23 Thread Fernando Avalos
Hi Spark guys, I am getting the information from Streaming and I transform the information: questionStream.filter(_.getEntity.isEmpty).mapPartitions[Choice](questionHandler(_)).saveToCassandra("test","question" I am getting the information from Streaming, I do some filtering and I transform to

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-23 Thread Neil Chang
One example for using dapply is to apply linear regression on many small partitions. I think red can do that with parallelism too but heard dapply is faster. On Friday, July 22, 2016, Pedro Rodriguez wrote: > I haven't used SparkR/R before, only Scala/Python APIs so I

Re: Size exceeds Integer.MAX_VALUE

2016-07-23 Thread Andrew Ehrlich
It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which limits the size of the blocks in the file being written to disk to 2GB. If so, the solution is for you to try tuning for smaller tasks. Try increasing the number of

Re: How to generate a sequential key in rdd across executors

2016-07-23 Thread Andrew Ehrlich
It’s hard to do in a distributed system. Maybe try generating a meaningful key using a timestamp + hashed unique key fields in the record? > On Jul 23, 2016, at 7:53 PM, yeshwanth kumar wrote: > > Hi, > > i am doing bulk load to hbase using spark, > in which i need to

Size exceeds Integer.MAX_VALUE

2016-07-23 Thread Ascot Moss
Hi, Please help! My spark: 1.6.2 Java: java8_u40 I am trying random forest training, I got " Size exceeds Integer.MAX_VALUE". Any idea how to resolve it? (the log) 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 25) java.lang.IllegalArgumentException: Size exceeds

How to generate a sequential key in rdd across executors

2016-07-23 Thread yeshwanth kumar
Hi, i am doing bulk load to hbase using spark, in which i need to generate a sequential key for each record, the key should be sequential across all the executors. i tried zipwith index, didn't worked because zipwith index gives index per executor not across all executors. looking for some

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread VG
Any suggestions / ideas here ? On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: > Sean, > > I did this just to test the model. When I do a split of my data as > training to 80% and test to be 20% > > I get a Root-mean-square error = NaN > > So I am wondering where I might be

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ascot Moss
I tried to add -Xloggc:./jvm_gc.log --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:./jvm_gc.log -XX:+PrintGCDateStamps" however, I could not find ./jvm_gc.log How to resolve the OOM and gc log issue? Regards On Sun, Jul 24, 2016 at 6:37

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ascot Moss
My JDK is Java 1.8 u40 On Sun, Jul 24, 2016 at 3:45 AM, Ted Yu wrote: > Since you specified +PrintGCDetails, you should be able to get some more > detail from the GC log. > > Also, which JDK version are you using ? > > Please use Java 8 where G1GC is more reliable. > > On

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ted Yu
Since you specified +PrintGCDetails, you should be able to get some more detail from the GC log. Also, which JDK version are you using ? Please use Java 8 where G1GC is more reliable. On Sat, Jul 23, 2016 at 10:38 AM, Ascot Moss wrote: > Hi, > > I added the following

Re: spark and plot data

2016-07-23 Thread Gourav Sengupta
And we are all smiling: https://github.com/bokeh/bokeh-scala Something that helped me immensely, particularly the example. https://github.com/bokeh/bokeh-scala/issues/24 Please note that I use Toree as the Jupyter kernel. Regards, Gourav Sengupta On Sat, Jul 23, 2016 at 8:01 PM, Andrew

Re: spark and plot data

2016-07-23 Thread Andrew Ehrlich
@Gourav, did you find any good inline plotting tools when using the Scala kernel? I found one based on highcharts but it was not frictionless the way matplotlib is. > On Jul 23, 2016, at 2:26 AM, Gourav Sengupta > wrote: > > Hi Pedro, > > Toree is Scala kernel for

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Andrew Ehrlich
+1 for the misleading error. Messages about failing to connect often mean that an executor has died. If so, dig into the executor logs and find out why the executor died (out of memory, perhaps). Andrew > On Jul 23, 2016, at 11:39 AM, VG wrote: > > Hi Pedro, > > Based on

Re: How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread Andrew Ehrlich
As far as I know, the best you can do is refer to the Actions by line number. > On Jul 23, 2016, at 8:47 AM, unk1102 wrote: > > Hi I have multiple child spark jobs run at a time. Is there any way to name > these child spark jobs so I can identify slow running ones. For e.

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread VG
Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > No, that's

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread Sean Owen
No, that's certainly not to be expected. ALS works by computing a much lower-rank representation of the input. It would not reproduce the input exactly, and you don't want it to -- this would be seriously overfit. This is why in general you don't evaluate a model on the training set. On Sat, Jul

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread VG
Hi Pedro, Based on your suggestion, I deployed this on a aws node and it worked fine. thanks for your advice. I am still trying to figure out the issues on the local environment Anyways thanks again -VG On Sat, Jul 23, 2016 at 9:26 PM, Pedro Rodriguez wrote: > Have

Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread VG
I am trying to run ml.ALS to compute some recommendations. Just to test I am using the same dataset for training using ALSModel and for predicting the results based on the model . When I evaluate the result using RegressionEvaluator I get a Root-mean-square error = 1.5544064263236066 I thin

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ascot Moss
Hi, I added the following parameter: --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 -XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" Still got Java heap space error. Any idea to

SaveToCassandra executed when I stop Spark

2016-07-23 Thread Fernando Avalos
> Hi Spark guys, > > I am getting the information from Streaming and I transform the information: > > questionStream.filter(_.getEntity.isEmpty).mapPartitions[Choice](questionHandler(_)).saveToCassandra("test","question" > > I am getting the information from Streaming, I do some filtering and I

Re: Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Pedro Rodriguez
Hi Jestin, Spark is smart about how it does joins. In this case, if df2 is sufficiently small it will do a broadcast join. Basically, rather than shuffle df1/df2 for a join, it broadcasts df2 to all workers and joins locally. Looks like you may already have known that though based on using the

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Marco Mistroni
Hi vg I believe the error msg is misleading. I had a similar one with pyspark yesterday after calling a count on a data frame, where the real error was with an incorrect user defined function being applied . Pls send me some sample code with a trimmed down version of the data and I see if i can

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Pedro Rodriguez
Have you changed spark-env.sh or spark-defaults.conf from the default? It looks like spark is trying to address local workers based on a network address (eg 192.168……) instead of on localhost (localhost, 127.0.0.1, 0.0.0.0,…). Additionally, that network address doesn’t resolve correctly. You

How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread unk1102
Hi I have multiple child spark jobs run at a time. Is there any way to name these child spark jobs so I can identify slow running ones. For e. g. xyz_saveAsTextFile(), abc_saveAsTextFile() etc please guide. Thanks in advance. -- View this message in context:

Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Jestin Ma
Hello, Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2. The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread VG
Hi pedro, Apologies for not adding this earlier. This is running on a local cluster set up as follows. JavaSparkContext jsc = new JavaSparkContext("local[2]", "DR"); Any suggestions based on this ? The ports are not blocked by firewall. Regards, On Sat, Jul 23, 2016 at 8:35 PM, Pedro

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Pedro Rodriguez
Make sure that you don’t have ports firewalled. You don’t really give much information to work from, but it looks like the master can’t access the worker nodes for some reason. If you give more information on the cluster, networking, etc, it would help. For example, on AWS you can create a

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-23 Thread Sun Rui
You should use : import org.apache.spark.sql.catalyst.encoders.RowEncoder val df = spark.read.parquet(fileName) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) val df1 = df.flatMap { x => List(x) } > On Jul 23, 2016, at 22:01, Julien Nauroy wrote:

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-23 Thread Julien Nauroy
Thanks for your quick reply. I've tried with this encoder: implicit def RowEncoder: org.apache.spark.sql.Encoder[Row] = org.apache.spark.sql.Encoders.kryo[Row] Using a suggestion from http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-23 Thread Sun Rui
I did a try. the schema after flatMap is the same, which is expected. What’s your Row encoder? > On Jul 23, 2016, at 20:36, Julien Nauroy wrote: > > Hi, > > I'm trying to call flatMap on a Dataframe with Spark 2.0 (rc5). > The code is the following: > var data =

Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread VG
Please suggest if I am doing something wrong or an alternative way of doing this. I have an RDD with two values as follows JavaPairRDD rdd When I execute rdd..collectAsMap() it always fails with IO exceptions. 16/07/23 19:03:58 ERROR RetryingBlockFetcher: Exception while

spark context stop vs close

2016-07-23 Thread Mail.com
Hi All, Where should we us spark context stop vs close. Should we stop the context first and then close. Are general guidelines around this. When I stop and later try to close I get RPC already closed error. Thanks, Pradeep

Using flatMap on Dataframes with Spark 2.0

2016-07-23 Thread Julien Nauroy
Hi, I'm trying to call flatMap on a Dataframe with Spark 2.0 (rc5). The code is the following: var data = spark.read.parquet(fileName).flatMap(x => List(x)) Of course it's an overly simplified example, but the result is the same. The dataframe schema goes from this: root |-- field1:

Re: Role-based S3 access outside of EMR

2016-07-23 Thread Steve Loughran
Amazon S3 has stronger consistency guarantees than the ASF s3 clients, it uses dynamo to do this. there is some work underway to do something similar atop S3a, S3guard, see https://issues.apache.org/jira/browse/HADOOP-13345 . Regarding IAM support in Spark, The latest version of S3A, which

Re: spark and plot data

2016-07-23 Thread andy petrella
Heya, Might be worth checking the spark-notebook I guess, it offers custom and reactive dynamic charts (scatter, line, bar, pie, graph, radar, parallel, pivot, …) for any kind of data from an intuitive and easy Scala API (with server side, incl. spark based, sampling

Re: spark and plot data

2016-07-23 Thread Gourav Sengupta
Hi Pedro, Toree is Scala kernel for Jupyter in case anyone needs a short intro. I use it regularly (when I am not using IntelliJ) and its quite good. Regards, Gourav On Fri, Jul 22, 2016 at 11:15 PM, Pedro Rodriguez wrote: > As of the most recent 0.6.0 release its

Re: spark and plot data

2016-07-23 Thread Gourav Sengupta
Hi Taotao, that is the way its usually used to visualize data from SPARK. But I do see that people transfer the data to list to feed to Matplot (as in the SPARK course currently running in EDX). Please try using blaze and bokeh and you will be in a new world altogether. Regards, Gourav On