Re: Research ideas using spark

2015-07-15 Thread Robin East
Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes

Spark Accumulator Issue - java.io.IOException: java.lang.StackOverflowError

2015-07-15 Thread Jadhav Shweta
Hi, I am trying one transformation by calling scala method this scala method returns MutableList[AvroObject] def processRecords(id: String, list1: Iterable[(String, GenericRecord)]): scala.collection.mutable.MutableList[AvroObject]  Hence, the output of transaformation is

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
I personally would try to avoid updateStateByKey for sessionization when you have long sessions / a lot of keys, because it's linear on the number of keys. On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das t...@databricks.com wrote: [Apologies for repost, for those who have seen this response

Re: Spark and HDFS

2015-07-15 Thread ayan guha
Assuming you run spark locally (ie either local mode or standalone cluster on your localm/c) 1. You need to have hadoop binaries locally 2. You need to have hdfs-site on Spark Classpath of your local m/c I would suggest you to start off with local files to play around. If you need to run spark

Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, I have been playing with Spark R API that is introduced in Spark 1.4 version. Can we use any mllib functionality from the R as of now?. From the documentation it looks like we can only use SQL/Dataframe functionality as of now. I know there is separate project SparkR project but it doesnot

Re: [SparkR] creating dataframe from json file

2015-07-15 Thread jianshu Weng
Thanks. t - getField(df$hashtags, text) does return a Column. But when I tried to call t - getField(df$hashtags, text), it would give an error: Error: All select() inputs must resolve to integer column positions. The following do not: * getField(df$hashtags, text) In fact, the text field in df

Re: Research ideas using spark

2015-07-15 Thread William Temperley
There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread Burak Yavuz
Hi, There is no MLlib support in SparkR in 1.4. There will be some support in 1.5. You can check these JIRAs for progress: https://issues.apache.org/jira/browse/SPARK-6805 https://issues.apache.org/jira/browse/SPARK-6823 Best, Burak On Wed, Jul 15, 2015 at 6:00 AM, madhu phatak

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
An in-memory hash key data structure of some kind so that you're close to linear on the number of items in a batch, not the number of outstanding keys. That's more complex, because you have to deal with expiration for keys that never get hit, and for unusually long sessions you have to either

Tasks unevenly distributed in Spark 1.4.0

2015-07-15 Thread gisleyt
Hello all, I upgraded from spark 1.3.1 to 1.4.0, but I'm experiencing a massive drop in performance for the application I'm running. I've (somewhat) reproduced this behaviour in the attached file. My current spark setup may not be optimal exactly for this reproduction, but I see that Spark

Strange Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-07-15 Thread Saeed Shahrivari
I use a simple map/reduce step in a Java/Spark program to remove duplicated documents from a large (10 TB compressed) sequence file containing some html pages. Here is the partial code: JavaPairRDDBytesWritable, NullWritable inputRecords = sc.sequenceFile(args[0], BytesWritable.class,

java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-15 Thread Wang, Ningjun (LNG-NPV)
I just installed spark 1.3.1 on windows 2008 server. When I start spark-shell, I got the following error Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 Please advise. Thanks. Ningjun

DataFrame more efficient than RDD?

2015-07-15 Thread k0ala
Hi, I have been working a bit with RDD, and am now taking a look at DataFrames. The schema definition using case classes looks very attractive; https://spark.apache.org/docs/1.4.0/sql-programming-guide.html#inferring-the-schema-using-reflection

Re: Spark and HDFS

2015-07-15 Thread Naveen Madhire
Yes. I did this recently. You need to copy the cloudera cluster related conf files into the local machine and set HADOOP_CONF_DIR or YARN_CONF_DIR. And also local machine should be able to ssh to the cloudera cluster. On Wed, Jul 15, 2015 at 8:51 AM, ayan guha guha.a...@gmail.com wrote:

Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread lokeshkumar
Hi forum I have downloaded the latest spark version 1.4.0 and started using it. But I couldn't find the compute-classpath.sh file in bin/ which I am using in previous versions to provide third party libraries to my application. Can anyone please let me know where I can provide CLASSPATH with my

<    1   2