Messy GraphX merge/reduce functions

2014-02-26 Thread Dan Davies
I apologize in advance if the answer to this question is obvious, but here goes ... The GraphX merge and reduction functions take two objects of type A and return one object of type A. Is there a nice, efficient way to implement a reduction function like median or mode? These reduction functions

Re: spark/shark + cql3

2014-02-26 Thread Rohit Rai
The problem in this will be difference in storage and structure of data in various systems. I think saveAsNewAPIHadoopFile in PairRDDFunctions provides decent abstraction to write to any Hadoop supported output. A way to build a generic API to persist to different storage will be to create seriali

Re: [incubating-0.9.0] Too Many Open Files on Workers

2014-02-26 Thread Rohit Rai
Hello Andy, This is a problem we have seen in using the CQL Java driver under heavy ready loads where it is using NIO and is waiting on many pending responses which causes to many open sockets and hence too many open files. Are you by any chance using async queries? I am the maintainer of Calliop

Re: JVM error

2014-02-26 Thread Aaron Davidson
Setting spark.executor.memory is indeed the correct way to do this. If you want to configure this in spark-env.sh, you can use export SPARK_JAVA_OPTS=" -Dspark.executor.memory=20g" (make sure to append the variable if you've been using SPARK_JAVA_OPTS previously) On Wed, Feb 26, 2014 at 7:50 PM,

Re: JVM error

2014-02-26 Thread Bryn Keller
Hi Mohit, You can still set SPARK_MEM in spark-env.sh, but that is deprecated. This is from SparkContext.scala: if (!conf.contains("spark.executor.memory") && sys.env.contains("SPARK_MEM")) { logWarning("Using SPARK_MEM to set amount of memory to use per executor process is " + "depreca

Re: Build Spark in IntelliJ IDEA 13

2014-02-26 Thread moxiecui
Hi Chen: Check the file > Setting > Compiler > Java Compiler, see if there are some compiler settings to change. Hope that will help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-in-IntelliJ-IDEA-13-tp2081p2103.html Sent from the Apache

Re: JVM error

2014-02-26 Thread Mohit Singh
Hi Bryn, Thanks for responding. Is there a way I can permanently configure this setting? like SPARK_EXECUTOR_MEMORY or somethign like that? On Wed, Feb 26, 2014 at 2:56 PM, Bryn Keller wrote: > Hi Mohit, > > Try increasing the *executor* memory instead of the worker memory - the > most appro

Re: skipping ahead in RDD

2014-02-26 Thread Tathagata Das
If you are doing a computation where the result at time T depends on all the previous data till T, then Spark Streaming will automatically ask you to checkpoint the RDDs generated through Spark Streaming periodically. Checkpointing means saving the RDD to HDFS (or HDFS compatible system). Say the c

Re: skipping ahead in RDD

2014-02-26 Thread Mayur Rustagi
You can checkpoint & itll stop the lineage to only updates after the checkpoint. Regards Mayur Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Wed, Feb 26, 2014 at 1:23 PM, Adrian Mocanu wrote: > Hi > > S

worker keeps getting disassociated upon a failed job spark version 0.90

2014-02-26 Thread Shirish
I am an newbie!! I am running Spark 0.90 in standalone mode on my mac. The master and worker run on the same machine. Both of them startup fine (at least that is what I see in the log). *Upon start-up master log is:* 14/02/26 15:38:08 INFO Slf4jLogger: Slf4jLogger started 14/02/26 15:38:08 IN

Re: JVM error

2014-02-26 Thread Bryn Keller
Hi Mohit, Try increasing the *executor* memory instead of the worker memory - the most appropriate place to do this is actually when you're creating your SparkContext, something like: conf = pyspark.SparkConf() .setMaster("spark://master:7077") .setAp

Re: Build Spark in IntelliJ IDEA 13

2014-02-26 Thread Bryn Keller
Hi Yanzhe, With Intellij 13, I don't think you need to use gen-idea, it should be able to import the sbt project directly: http://blog.jetbrains.com/scala/2013/11/18/built-in-sbt-support-in-intellij-idea-13/#comment-2742 Hope that helps, Bryn On Wed, Feb 26, 2014 at 8:59 AM, Yanzhe Chen wrote

JVM error

2014-02-26 Thread Mohit Singh
Hi, I am experimenting with pyspark lately... Every now and then, I see this error bieng streamed to pyspark shell .. and most of the times.. the computation/operation completes.. and sometimes, it just gets stuck... My setup is 8 node cluster.. with loads of ram(256GB's) and space( TB's) per nod

skipping ahead in RDD

2014-02-26 Thread Adrian Mocanu
Hi Scenario: Say I've been streaming tuples with Spark for 24 hours and one of the nodes fails. The RDD will be recomputed on the other Spark nodes and the streaming continues. I'm interested to know how I can skip the first 23 hours and jump in the stream to the last hour. Is this possible?

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Bryn Keller
In the past I've handled this by filtering out the header line, but it seems to me that it would be useful to have a way of dealing with files that would preserve sequence, so that e.g. you could just do mySequentialRDD.drop(1) to get rid of the header. There are other use cases like this that curr

Actors and sparkcontext actions

2014-02-26 Thread Ognen Duzlevski
Can someone point me to a simple, short code example of creating a basic Actor that gets a context and runs an operation such as .textFile.count? I am trying to figure out how to create just a basic actor that gets a message like this: case class Msg(filename:String, ctx: SparkContext) and th

Re: window every n elements instead of time based

2014-02-26 Thread Tathagata Das
Currently, all in-built DStream operation is time-based windowing. We may provide count-based windowing in the future. On Wed, Feb 26, 2014 at 9:34 AM, Adrian Mocanu wrote: > Hi > > Is there a way to do window processing but not based on time but every 6 > items going through the stream? > > >

Re: specify output format using pyspark

2014-02-26 Thread Chengi Liu
Cool.Thanks On Wed, Feb 26, 2014 at 9:48 AM, Ewen Cheslack-Postava wrote: > You need to convert it to the format you want yourself. The output you're > seeing is just the automatic conversion of your data by unicode(). > > -Ewen > > Chengi Liu > February 26, 2014 at 9:43 AM > Hi, > How do

specify output format using pyspark

2014-02-26 Thread Chengi Liu
Hi, How do we save data to hdfs using pyspark in "right" format. I use: counts = counts.saveAsTextFile("hdfs://localhost:1234//foo") But when I look into the data... It is always in tuple format (1245,23) (1235,99) How do i specify output format in pyspark. Thanks

Re: specify output format using pyspark

2014-02-26 Thread Ewen Cheslack-Postava
You need to convert it to the format you want yourself. The output you're seeing is just the automatic conversion of your data by unicode(). -Ewen Chengi Liu February 26, 2014 at 9:43 AM Hi,  How do we save data to hdfs using pyspark in "right" format.I use:counts = cou

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Ewen Cheslack-Postava
You must be parsing each line of the file at some point anyway, so adding a step to filter out the header should work fine. It'll get executed at the same time as your parsing/conversion to ints, so there's no significant overhead aside from the check itself. For standalone programs, there's a

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Chengi Liu
I am not sure.. the suggestion is to open a TB file and remove a line? That doesnt sounds that good. I am hacking my way by using a filter.. Can I put a try:except clause in my lambda function.. Maybe i should just try that out. But thanks for the suggestion. Also, can i run scripts against spark

window every n elements instead of time based

2014-02-26 Thread Adrian Mocanu
Hi Is there a way to do window processing but not based on time but every 6 items going through the stream? Example: Window of size 3 with 1 item "duration" Stream data: 1,2,3,4,5,6,7 [1,2,3]=window 1 [2,3,4]=window 2 [3,4,5]=window 2 etc -Adrian

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Mayur Rustagi
Bad solution is to run a mapper through the data and null the counts , good solution is to trim the header before hand without Spark. On Feb 26, 2014 9:28 AM, "Chengi Liu" wrote: > Hi, > How do we deal with headers in csv file. > For example: > id, counts > 1,2 > 1,5 > 2,20 > 2,25 > ... and so

Dealing with headers in csv file pyspark

2014-02-26 Thread Chengi Liu
Hi, How do we deal with headers in csv file. For example: id, counts 1,2 1,5 2,20 2,25 ... and so on And I want to do a frequency count of counts for each id. So result will be : 1,7 2,45 and so on.. My code: counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a + b)) But

Re: Build Spark in IntelliJ IDEA 13

2014-02-26 Thread Sean Owen
I also use IntelliJ 13 on a Mac, with only Java 7, and have never seen this. If you look at the Spark build, you will see that it specifies Java 6, not 7. Even if you changed java.version in the build, you would not get this error, since it specifies source and target to be the same value. In fact

Build Spark in IntelliJ IDEA 13

2014-02-26 Thread Yanzhe Chen
Hi, all I'm trying to build Spark in IntelliJ IDEA 13. I clone the latest repo and run sbt/sbt gen-idea in the root folder. Then import it into IntelliJ IDEA. Scala plugin for IntelliJ IDEA has been installed. Everything seems ok until I ran Build > Make Project: Information: Using javac 1.7.0_

Re: ReduceByKey or groupByKey to Count?

2014-02-26 Thread Mayur Rustagi
JavaPairRDD has JavaRDD so you can directly call save file functions. Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Wed, Feb 26, 2014 at 7:20 AM, dmpour23 wrote: > If i use groupbyKey as so... > > Java

Re: ReduceByKey or groupByKey to Count?

2014-02-26 Thread dmpour23
If i use groupbyKey as so... JavaPairRDD> twos = ones.groupByKey(3).cache(); How would I write to a file/ or Hadoop the contents of the List of Strings. Do i need to transform the JavaPairRDD to JavaRDD and call f saveAsTextFile? -- View this message in context: http://apache-spark-user-li

Kyro Registration, class is not registered, but Log.TRACE() says otherwise

2014-02-26 Thread pondwater
Hello, I'm trying to use kryo serialization with the required registration flag set to true via kryo.setRegistrationRequired(true). I keep getting the following error saying that a certain class is not registered: java.lang.IllegalArgumentException: Class is not registered: com.my.package.MyClas

Re: HBase row count

2014-02-26 Thread Nick Pentreath
Currently no there is no way to save the web ui details. There was some discussion around adding this on the mailing list but no change as yet — Sent from Mailbox for iPhone On Tue, Feb 25, 2014 at 7:23 PM, Soumitra Kumar wrote: > Found the issue, actually splits in HBase was not uniform, so on

Re: Spark in YARN HDP problem

2014-02-26 Thread aecc
Hi Azuryy, This is what I got: LogType: stderr LogLength: 85 Log Contents: Error: Could not find or load main class org.apache.spark.deploy.yarn.WorkerLauncher LogType: stdout LogLength: 0 Log Contents: That made me thought about the consolidated jar file that I was including. It was a simple

Re: Spark in YARN HDP problem

2014-02-26 Thread Azuryy Yu
can you check the container logs and paste here? On Feb 26, 2014 7:07 PM, "aecc" wrote: > The error happens after the application is accepted: > > 14/02/26 12:05:08 INFO YarnClientImpl: Submitted application > application_1390483691679_0157 to ResourceManager at X:8050 > 3483 [main] INFO

Re: Spark in YARN HDP problem

2014-02-26 Thread aecc
The error happens after the application is accepted: 14/02/26 12:05:08 INFO YarnClientImpl: Submitted application application_1390483691679_0157 to ResourceManager at X:8050 3483 [main] INFO org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application report from ASM:

Re: Implementing a custom Spark shell

2014-02-26 Thread Matei Zaharia
In Spark 0.9 and master, you can pass the -i argument to spark-shell to load a script containing commands before opening the prompt. This is also a feature of the Scala shell as a whole (try scala -help for details). Also, once you’re in the shell, you can use :load file.scala to execute the co