Re: RDD generated on every query

2015-04-14 Thread Akhil Das
You can use a tachyon based storage for that and everytime the client queries, you just get it from there. Thanks Best Regards On Mon, Apr 6, 2015 at 6:01 PM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi , In Spark Web Application the RDD is generating every time client is

Re: set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-14 Thread Akhil Das
You could try leaving all the configuration values to default and running your application and see if you are still hitting the heap issue, If so try adding a Swap space to the machines which will definitely help. Another way would be to set the heap space manually (export _JAVA_OPTIONS=-Xmx5g)

Re: Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-14 Thread Akhil Das
When you say done fetching documents, does it mean that you are stopping the streamingContext? (ssc.stop) or you meant completed fetching documents for a batch? If possible, you could paste your custom receiver code so that we can have a look at it. Thanks Best Regards On Tue, Apr 7, 2015 at

Re: start-slave.sh not starting

2015-04-14 Thread Akhil Das
Why are you not using sbin/start-all.sh? Thanks Best Regards On Wed, Apr 8, 2015 at 10:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to start the worker by: sbin/start-slave.sh spark://ip-10-241-251-232:7077 In the logs it's complaining about: Master must be a URL of the

Re: RDD generated on every query

2015-04-14 Thread twinkle sachdeva
Hi, If you have the same spark context, then you can cache the query result via caching the table ( sqlContext.cacheTable(tableName) ). Maybe you can have a look at OOyola server also. On Tue, Apr 14, 2015 at 11:36 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can use a tachyon based

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-04-14 Thread Akhil Das
One hack you can put in would be to bring Result class http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hbase/hbase/0.89.20100924-28/org/apache/hadoop/hbase/client/Result.java/?v=source locally and serialize it (implements serializable) and use it.

Re: value reduceByKeyAndWindow is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2015-04-14 Thread Akhil Das
Just make sure you import the followings: import org.apache.spark.SparkContext._ import org.apache.spark.StreamingContext._ Thanks Best Regards On Wed, Apr 8, 2015 at 6:38 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am trying to implement this example (Spark Streaming with

Re: Expected behavior for DataFrame.unionAll

2015-04-14 Thread Reynold Xin
I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int.

Re: set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-14 Thread twinkle sachdeva
Hi, In one of the application we have made which had no clone stuff, we have set the value of spark.storage.memoryFraction to very low, and yes that gave us performance benefits. Regarding that issue, you should also look at the data you are trying to broadcast, as sometimes creating that data

Re: Cannot change the memory of workers

2015-04-14 Thread Akhil Das
If you want to use 2g of memory on each worker, you can simply export SPARK_WORKER_MEMORY=2g inside your spark-env.sh on all machine in the cluster. Thanks Best Regards On Wed, Apr 8, 2015 at 7:27 AM, Jia Yu jia...@asu.edu wrote: Hi guys, Currently I am running Spark program on Amazon EC2.

Re: Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] when create context

2015-04-14 Thread Akhil Das
Can you share a bit more information on the type of application that you are running? From the stacktrace i can only say, for some reason your connection timedout (prolly a GC pause or network issue) Thanks Best Regards On Wed, Apr 8, 2015 at 9:48 PM, Shuai Zheng szheng.c...@gmail.com wrote:

Running Spark on Gateway - Connecting to Resource Manager Retries

2015-04-14 Thread Vineet Mishra
Hi Team, I am running Spark Word Count example( https://github.com/sryza/simplesparkapp), if I go with master as local it works fine. But when I change the master to yarn its end with retries connecting to resource manager(stack trace mentioned below), 15/04/14 11:31:57 INFO RMProxy: Connecting

Re: [Spark1.3] UDF registration issue

2015-04-14 Thread Reynold Xin
You can do this: strLen = udf((s: String) = s.length()) cleanProcessDF.withColumn(dii,strLen(col(di))) (You might need to play with the type signature a little bit to get it to compile) On Fri, Apr 10, 2015 at 11:30 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, I'm running into some

Re: SparkSQL + Parquet performance

2015-04-14 Thread Akhil Das
That totally depends on your disk IO and the number of CPUs that you have in the cluster. For example, if you are having a disk IO of 100MB/s and a handful of CPUs ( say 40 cores, on 10 machines), then it could take you to ~ 1GB/Sec i believe. Thanks Best Regards On Tue, Apr 7, 2015 at 2:48 AM,

Re: Expected behavior for DataFrame.unionAll

2015-04-14 Thread Justin Yip
That explains it. Thanks Reynold. Justin On Mon, Apr 13, 2015 at 11:26 PM, Reynold Xin r...@databricks.com wrote: I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int.

Re: streamSQL - is it available or is it in POC ?

2015-04-14 Thread Akhil Das
We have a similar version (Sigstream), you could find more over here https://sigmoid.com/ Thanks Best Regards On Wed, Apr 8, 2015 at 9:25 AM, haopu hw...@qilinsoft.com wrote: I'm also interested in this project. Do you have any update on it? Is it still active? Thank you! -- View this

Re: save as text file throwing null pointer error.

2015-04-14 Thread Akhil Das
Where exactly is it throwing null pointer exception? Are you starting your program from another program or something? looks like you are invoking ProcessingBuilder etc. Thanks Best Regards On Thu, Apr 9, 2015 at 6:46 PM, Somnath Pandeya somnath_pand...@infosys.com wrote: JavaRDDString

Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread pishen
Hello, I tried to follow the tutorial of Spark SQL, but is not able to saveAsParquetFile from a RDD of case class. Here is my Main.scala and build.sbt https://gist.github.com/pishen/939cad3da612ec03249f At line 34, compiler said that value saveAsParquetFile is not a member of

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread pishen tsai
OK, it do work. Maybe it will be better to update this usage in the official Spark SQL tutorial: http://spark.apache.org/docs/latest/sql-programming-guide.html Thanks, pishen 2015-04-14 15:30 GMT+08:00 fightf...@163.com fightf...@163.com: Hi,there If you want to use the saveAsParquetFile,

Spark: Using node-local files within functions?

2015-04-14 Thread Horsmann, Tobias
Hi, I am trying to use Spark in combination with Yarn with 3rd party code which is unaware of distributed file systems. Providing hdfs file references thus does not work. My idea to resolve this issue was the following: Within a function I take the HDFS file reference I get as parameter and

Spark SQL reading parquet decimal

2015-04-14 Thread Clint McNeil
Hi guys I have parquet data written by Impala: Server version: impalad version 2.1.2-cdh5 RELEASE (build 36aad29cee85794ecc5225093c30b1e06ffb68d3) When using Spark SQL 1.3.0 (spark-assembly-1.3.0-hadoop2.4.0) i get the following error: val correlatedEventData = sqlCtx.sql( s

Re: Spark Streaming not picking current date properly

2015-04-14 Thread Akhil Das
You can try something like this: ​eventsDStream.foreachRDD(rdd = { val curdate = new DateTime() val fmt = DateTimeFormat.forPattern(dd_MM_); rdd.saveAsTextFile(s3n://bucket_name/test/events_+fmt.print(curdate)+/events) }) Thanks Best Regards On Fri, Apr 10, 2015 at 4:22

Spark Data Formats ?

2015-04-14 Thread ๏̯͡๏
Can you please share the native support of data formats available with Spark. Two i can see are parquet and textFile sc.parquetFile sc.textFile I see that Hadoop Input Formats (Avro) are having issues, that i faced in earlier threads and seems to be well known.

Re: Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] when create context

2015-04-14 Thread Sean Owen
This usually means something didn't start due to a fairly low-level error, like a class not found or incompatible Spark versions somewhere. At least, that's also what I see in unit tests when things like that go wrong. On Tue, Apr 14, 2015 at 8:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

Re: Spark Data Formats ?

2015-04-14 Thread Akhil Das
There's sc.objectFile also. Thanks Best Regards On Tue, Apr 14, 2015 at 2:59 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Can you please share the native support of data formats available with Spark. Two i can see are parquet and textFile sc.parquetFile sc.textFile I see that Hadoop

Clustering users according to their shopping traits

2015-04-14 Thread Zork Sail
Sorry for off-topic, have not foud specific MLLib forum/ Please, advise a good overview of using clustering algorithms to group users according to user purchase and browsing history on a web site. Thanks!

Spark SQL reading parquet decimal

2015-04-14 Thread Sparkle
Hi guys I have parquet data written by Impala: Server version: impalad version 2.1.2-cdh5 RELEASE (build 36aad29cee85794ecc5225093c30b1e06ffb68d3) When using Spark SQL 1.3.0 (spark-assembly-1.3.0-hadoop2.4.0) i get the following error: val correlatedEventData = sqlCtx.sql( s

Spark Yarn-client Kerberos on remote cluster

2015-04-14 Thread philippe L
Dear All, I would like to know if its possible to configure the SparkConf() in order to interact with a remote kerberized cluster in yarn-client mode. the spark will not be installed on the cluster itself and the localhost can't ask for a ticket, But a keytab as been generated in purpose and

Re: Increase partitions reading Parquet File

2015-04-14 Thread Masf
Hi. It doesn't work. val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/ file.parquet) file.repartition(127) println(h.partitions.size.toString()) -- Return 27! Regards On Fri, Apr 10, 2015 at 4:50 PM, Felix C felixcheun...@hotmail.com wrote: RDD.repartition(1000)?

Spark 1.2, trying to run spark-history as a service, spark-defaults.conf are ignored

2015-04-14 Thread Serega Sheypak
Here is related problem: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html but no answer. What I'm trying to do: wrap spark-history with /etc/init.d script Problems I have: can't make it read spark-defaults.conf I've put this file here:

Re: Spark Yarn-client Kerberos on remote cluster

2015-04-14 Thread Neal Yin
If your localhost can¹t talk to a KDC, you can¹t access a kerberized cluster. Only key tab file is not enough. -Neal On 4/14/15, 3:54 AM, philippe L lanckvrind.p@gmail.com wrote: Dear All, I would like to know if its possible to configure the SparkConf() in order to interact with a remote

Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-14 Thread Cheng Lian
Would you mind to open a JIRA for this? I think your suspicion makes sense. Will have a look at this tomorrow. Thanks for reporting! Cheng On 4/13/15 7:13 PM, zhangxiongfei wrote: Hi experts I run below code in Spark Shell to access parquet files in Tachyon. 1.First,created a DataFrame by

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. From the source code comments: // SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the // application finishes. On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote: Does

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are are restarting long running jobs once in a while for cleanups and have spark.cleaner.ttl set to a lower value than the default. On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote: Right, I

Converting Date pattern in scala code

2015-04-14 Thread BASAK, ANANDA
I need some help to convert the date pattern in my Scala code for Spark 1.3. I am reading the dates from two flat files having two different date formats. File 1: 2015-03-27 File 2: 02-OCT-12 09-MAR-13 This format of file 2 is not being recognized by my Spark SQL when I am comparing it in a

Re: Running Spark on Gateway - Connecting to Resource Manager Retries

2015-04-14 Thread Neal Yin
Your Yarn access is not configured. 0.0.0.0:8032http://0.0.0.0:8032 this is default yarn address. I guess you don't have yarn-site.xml in your classpath. -Neal From: Vineet Mishra clearmido...@gmail.commailto:clearmido...@gmail.com Date: Tuesday, April 14, 2015 at 12:05 AM To:

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Richard Marscher
Hi, I've gotten an application working with sbt-assembly and spark, thought I'd present an option. In my experience, trying to bundle any of the Spark libraries in your uber jar is going to be a major pain. There will be a lot of deduplication to work through and even if you resolve them it can

Re: org.apache.spark.ml.recommendation.ALS

2015-04-14 Thread Xiangrui Meng
Yes, I think the default Spark builds are on Scala 2.10. You need to follow instructions at http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 to build 2.11 packages. -Xiangrui On Mon, Apr 13, 2015 at 4:00 PM, Jay Katukuri jkatuk...@apple.com wrote: Hi Xiangrui,

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
Hi Vadim, After removing provided from org.apache.spark %% spark-streaming-kinesis-asl I ended up with huge number of deduplicate errors: https://gist.github.com/trienism/3d6f8d6b7ff5b7cead6a It would be nice if you could share some pieces of your mergeStrategy code for reference. Also, after

Re: Regarding benefits of using more than one cpu for a task in spark

2015-04-14 Thread Imran Rashid
Hi twinkle, To be completely honest, I'm not sure, I had never heard spark.task.cpus before. But I could imagine two different use cases: a) instead of just relying on spark's creation of tasks for parallelism, a user wants to run multiple threads *within* a task. This is sort of going against

Re: Spark Data Formats ?

2015-04-14 Thread Michael Armbrust
Spark SQL (which also can give you an RDD for use with the standard Spark RDD API) has support for json, parquet, and hive tables http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources. There is also a library for Avro https://github.com/databricks/spark-avro. On Tue, Apr 14,

Re: Spark SQL reading parquet decimal

2015-04-14 Thread Michael Armbrust
Can you open a JIRA? On Tue, Apr 14, 2015 at 1:56 AM, Clint McNeil cl...@impactradius.com wrote: Hi guys I have parquet data written by Impala: Server version: impalad version 2.1.2-cdh5 RELEASE (build 36aad29cee85794ecc5225093c30b1e06ffb68d3) When using Spark SQL 1.3.0

Re: How to access postgresql on Spark SQL

2015-04-14 Thread Michael Armbrust
There is an example here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Mon, Apr 13, 2015 at 6:07 PM, doovs...@sina.com wrote: Hi all, Who know how to access postgresql on Spark SQL? Do I need add the postgresql dependency in build.sbt and set

Re: Increase partitions reading Parquet File

2015-04-14 Thread Michael Armbrust
RDDs are immutable. Running .repartition does not change the RDD, but instead returns *a new RDD *with more partitions. On Tue, Apr 14, 2015 at 3:59 AM, Masf masfwo...@gmail.com wrote: Hi. It doesn't work. val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/ file.parquet)

TaskResultLost

2015-04-14 Thread Pat Ferrel
Running on Spark 1.1.1 Hadoop 2.4 with Yarn AWS dedicated cluster (non-EMR) Is this in our code or config? I’ve never run into a TaskResultLost, not sure what can cause that. TaskResultLost (result lost from block manager) nivea.m https://gd-a.slack.com/team/nivea.m[11:01 AM] collect at

RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
Ø Also, it can be a problem when reusing the same sparkcontext for many runs. That is what happen to me. We use spark jobserver and use one sparkcontext for all jobs. The SPARK_LOCAL_DIRS is not cleaned up and is eating disk space quickly. Ningjun From: Marius Soutier

spark streaming printing no output

2015-04-14 Thread Shushant Arora
Hi I am running a spark streaming application but on console nothing is getting printed. I am doing 1.bin/spark-shell --master clusterMgrUrl 2.import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream

park-assembly-1.3.0-hadoop2.3.0.jar has unsigned entries - org/apache/spark/SparkHadoopWriter$.class

2015-04-14 Thread Manoj Samel
With Spark 1.3 xx.saveAsTextFile(path, codec) gives following trace. Same works with Spark 1.2 Config is CDH 5.3.0 (Hadoop 2.3) with Kerberos 15/04/14 18:06:15 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 2.0 (TID 17) on executor node1078.svc.devpg.pdx.wd: java.lang.SecurityException

Re: How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
I forgot to mention that the imageId field is a custom scala object. Do I need to implement some special method to make it works (equal, hashCode ) ? On Tue, Apr 14, 2015 at 5:00 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, In the latest version of spark there's a feature called :

Re: Spark: Using node-local files within functions?

2015-04-14 Thread Sandy Ryza
Hi Tobias, It should be possible to get an InputStream from an HDFS file. However, if your libraries only work directly on files, then maybe that wouldn't work? If that's the case and different tasks need different files, your way is probably the best way. If all tasks need the same file, a

How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
Dear all, In the latest version of spark there's a feature called : automatic partition discovery and Schema migration for parquet. As far as I know, this gives the ability to split the DataFrame into several parquet files, and by just loading the parent directory one can get the global schema of

Saving RDDs as custom output format

2015-04-14 Thread Daniel Haviv
Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

How to join RDD keyValuePairs efficiently

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I need to find the documents by ids quickly. Currently I used RDD join as follow First I save the RDD as object file allDocs : RDD[Document] = getDocs() // this RDD contains 7 million

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Guillaume Pitel
Right, I remember now, the only problematic case is when things go bad and the cleaner is not executed. Also, it can be a problem when reusing the same sparkcontext for many runs. Guillaume It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. From the source code

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Tathagata Das
Fundamentally, stream processing systems are designed for processing streams of data, not for storing large volumes of data for a long period of time. So if you have to maintain that much state for months, then its best to use another system that is designed for long term storage (like Cassandra)

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Vadim Bichutskiy
Thanks guys. This might explain why I might be having problems. Vadim ᐧ On Tue, Apr 14, 2015 at 5:27 PM, Mike Trienis mike.trie...@orcsol.com wrote: Richard, You response was very helpful and actually resolved my issue. In case others run into a similar issue, I followed the procedure:

Re: Registering classes with KryoSerializer

2015-04-14 Thread Imran Rashid
hmm, I dunno why IntelliJ is unhappy, but you can always fall back to getting a class from the String: Class.forName(scala.reflect.ClassTag$$anon$1) perhaps the class is package private or something, and the repl somehow subverts it ... On Tue, Apr 14, 2015 at 5:44 PM, Arun Lists

Re: Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-14 Thread Tathagata Das
What version of Spark are you using? There was a known bug which could be causing this. It got fixed in Spark 1.3 TD On Mon, Apr 13, 2015 at 11:44 PM, Akhil Das ak...@sigmoidanalytics.com wrote: When you say done fetching documents, does it mean that you are stopping the streamingContext?

RE: save as text file throwing null pointer error.

2015-04-14 Thread Somnath Pandeya
Hi Akhil, I am running my program standalone, I am getting null pointer exception when I running spark program locally and when I am trying to save my RDD as a text file. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, April 14, 2015 12:41 PM To: Somnath Pandeya Cc:

Re: Need some guidance

2015-04-14 Thread Victor Tso-Guillen
Thanks, yes. I was using Int for my V and didn't get the second param in the second closure right :) On Mon, Apr 13, 2015 at 1:55 PM, Dean Wampler deanwamp...@gmail.com wrote: That appears to work, with a few changes to get the types correct: input.distinct().combineByKey((s: String) = 1,

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Tathagata Das
Have you tried marking only spark-streaming-kinesis-asl as not provided, and the rest as provided? Then you will not even need to add kinesis-asl.jar in the spark-submit. TD On Tue, Apr 14, 2015 at 2:27 PM, Mike Trienis mike.trie...@orcsol.com wrote: Richard, You response was very helpful

Re: spark streaming printing no output

2015-04-14 Thread Shixiong Zhu
Could you see something like this in the console? --- Time: 142905487 ms --- Best Regards, Shixiong(Ryan) Zhu 2015-04-15 2:11 GMT+08:00 Shushant Arora shushantaror...@gmail.com: Hi I am running a spark

Re: Registering classes with KryoSerializer

2015-04-14 Thread Arun Lists
Wow, it all works now! Thanks, Imran! In case someone else finds this useful, here are the additional classes that I had to register (in addition to my application specific classes): val tuple3ArrayClass = classOf[Array[Tuple3[Any, Any, Any]]] val anonClass =

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Krzysztof Zarzycki
Thank you Tathagata, very helpful answer. Though, I would like to highlight that recent stream processing systems are trying to help users in implementing use case of holding such large (like 2 months of data) states. I would mention here Samza state management

SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Hi guys, Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a DF from a JDBC data source. All seems to work well locally (master = local[*]), however as soon as we try and run on YARN we have problems. We seem to be running into problems with the class path and loading up the

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Just an update, tried with the old JdbcRDD and that worked fine. From: Nathan nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au Date: Wednesday, 15 April 2015 1:57 pm To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread pishen tsai
I've changed it to import sqlContext.implicits._ but it still doesn't work. (I've updated the gist) BTW, using .toDF() do work, thanks for this information. Regards, pishen 2015-04-14 20:35 GMT+08:00 Todd Nist tsind...@gmail.com: I think docs are correct. If you follow the example from the

Re: counters in spark

2015-04-14 Thread Imran Rashid
Hi Robert, A lot of task metrics are already available for individual tasks. You can get these programmatically by registering a SparkListener, and you van also view them in the UI. Eg., for each task, you can see runtime, serialization time, amount of shuffle data read, etc. I'm working on

Re: spark ml model info

2015-04-14 Thread Xiangrui Meng
If you are using Scala/Java or pyspark.mllib.classification.LogisticRegressionModel, you should be able to call weights and intercept to get the model coefficients. If you are using the pipeline API in Python, you can try model._java_model.weights(), we are going to add a method to get the weights

Re: Help understanding the FP-Growth algrithm

2015-04-14 Thread Xiangrui Meng
If you want to see an example that calls MLlib's FPGrowth, you can find them under the examples/ folder: Scala: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/FPGrowthExample.scala, Java:

Re: Registering classes with KryoSerializer

2015-04-14 Thread Imran Rashid
Hi Arun, It can be hard to use kryo with required registration because of issues like this -- there isn't a good way to register all the classes that you need transitively. In this case, it looks like one of your classes has a reference to a ClassTag, which in turn has a reference to some

Re: [BUG]Broadcast value return empty after turn to org.apache.spark.serializer.KryoSerializer

2015-04-14 Thread Imran Rashid
HI Shuai, I don't think this is a bug with kryo, its just a subtlety with the kryo works. I *think* that it would also work if you changed your PropertiesUtil class to either (a) remove the no-arg constructor or (b) instead of extending properties, you make it a contained member variable. I wish

Catching executor exception from executor in driver

2015-04-14 Thread Justin Yip
Hello, I would like to know if there is a way of catching exception throw from executor exception from the driver program. Here is an example: try { val x = sc.parallelize(Seq(1,2,3)).map(e = e / 0).collect } catch { case e: SparkException = { println(sERROR: $e) println(sCAUSE:

Re: Converting Date pattern in scala code

2015-04-14 Thread Denny Lee
If you're doing in Scala per se - then you can probably just reference JodaTime or Java Date / Time classes. If are using SparkSQL, then you can use the various Hive date functions for conversion. On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA ab9...@att.com wrote: I need some help to convert

Help understanding the FP-Growth algrithm

2015-04-14 Thread Eric Tanner
I am a total newbe to spark so be kind. I am looking for an example that implements the FP-Growth algorithm so I can better understand both the algorithm as well as spark. The one example I found (on spark .apache.org example) was incomplete. Thanks, Eric

Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Krzysztof Zarzycki
Hey guys, could you please help me with a question I asked on Stackoverflow: https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two ? I'll be really grateful for your help! I'm also pasting the question below: I'm trying to

Re: Array[T].distinct doesn't work inside RDD

2015-04-14 Thread Imran Rashid
Interesting, my gut instinct is the same as Sean's. I'd suggest debugging this in plain old scala first, without involving spark. Even just in the scala shell, create one of your Array[T], try calling .toSet and calling .distinct. If those aren't the same, then its got nothing to do with spark.

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-14 Thread Imran Rashid
Shuffle write could be a good indication of skew, but it looks like the task in question hasn't generated any shuffle write yet, because its still working on the shuffle-read side. So I wouldn't read too much into the fact that the shuffle write is 0 for a task that is still running. The

spark ml model info

2015-04-14 Thread Jianguo Li
Hi, I am training a model using the logistic regression algorithm in ML. I was wondering if there is any API to access the weight vectors (aka the co-efficients for each feature). I need those co-efficients for real time predictions. Thanks, Jianguo

Re: Registering classes with KryoSerializer

2015-04-14 Thread Arun Lists
Hi Imran, Thanks for the response! However, I am still not there yet. In the Scala interpreter, I can do: scala classOf[scala.reflect.ClassTag$$anon$1] but when I try to do this in my program in IntelliJ, it indicates an error: Cannot resolve symbol ClassTag$$anon$1 Hence I am not any closer

Re: Catching executor exception from executor in driver

2015-04-14 Thread Imran Rashid
(+dev) Hi Justin, short answer: no, there is no way to do that. I'm just guessing here, but I imagine this was done to eliminate serialization problems (eg., what if we got an error trying to serialize the user exception to send from the executors back to the driver?). Though, actually that

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread Todd Nist
I think docs are correct. If you follow the example from the docs and add this import shown below, I believe you will get what your looking for: // This is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._ You could also simply take your rdd and do the following:

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
Richard, You response was very helpful and actually resolved my issue. In case others run into a similar issue, I followed the procedure: - Upgraded to spark 1.3.0 - Add all spark related libraries are provided - Include spark transitive library dependencies where my build.sbt file

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread Michael Armbrust
More info on why toDF is required: http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 On Tue, Apr 14, 2015 at 6:55 AM, pishen tsai pishe...@gmail.com wrote: I've changed it to import sqlContext.implicits._ but it still doesn't work. (I've