Re: Connection pooling in spark jobs

2015-04-03 Thread Charles Feduke
Out of curiosity I wanted to see what JBoss supported in terms of clustering and database connection pooling since its implementation should suffice for your use case. I found: *Note:* JBoss does not recommend using this feature on a production environment. It requires accessing a connection pool

Re: Delaying failed task retries + giving failing tasks to different nodes

2015-04-03 Thread Akhil Das
I think these are the following configurations that you are looking for: *spark.locality.wait*: Number of milliseconds to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local,

Re: Mllib kmeans #iteration

2015-04-03 Thread amoners
Have you refer to official document of kmeans on https://spark.apache.org/docs/1.1.1/mllib-clustering.html ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html Sent from the Apache Spark User List mailing list archive

Re: Spark Application Stages and DAG

2015-04-03 Thread Tathagata Das
What he meant is that look it up in the Spark UI, specifically in the Stage tab to see what is taking so long. And yes code snippet helps us debug. TD On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need open the Stage\'s page which is taking time, and see how

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
How did you build spark? which version of spark are you having? Doesn't this thread already explains it? https://www.mail-archive.com/user@spark.apache.org/msg25505.html Thanks Best Regards On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote: Hi Akhil, Tried your suggestion

Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Akhil Das
Did you try these? - Disable shuffle : spark.shuffle.spill=false - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) Thanks Best Regards On Fri, Apr 3, 2015

Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Charles Feduke
You could also try setting your `nofile` value in /etc/security/limits.conf for `soft` to some ridiculously high value if you haven't done so already. On Fri, Apr 3, 2015 at 2:09 AM Akhil Das ak...@sigmoidanalytics.com wrote: Did you try these? - Disable shuffle : spark.shuffle.spill=false -

About Waiting batches on the spark streaming UI

2015-04-03 Thread bit1...@163.com
I copied the following from the spark streaming UI, I don't know why the Waiting batches is 1, my understanding is that it should be 72. Following is my understanding: 1. Total time is 1minute 35 seconds=95 seconds 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds. 3.

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-03 Thread Jaonary Rabarisoa
Good! Thank you. On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng men...@gmail.com wrote: I reproduced the bug on master and submitted a patch for it: https://github.com/apache/spark/pull/5329. It may get into Spark 1.3.1. Thanks for reporting the bug! -Xiangrui On Wed, Apr 1, 2015 at 12:57

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
I placed it there. It was downloaded from MySql site. On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Akhil you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar . how come you got this lib into spark/lib folder. 1) did you place it there ? 2) What

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Hi Deepujain, I did include the jar file, I believe it is hive-exe.jar, through the --jars option: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars

Re: 答复:maven compile error

2015-04-03 Thread Ted Yu
Can you include -X in your maven command and pastebin the output ? Cheers On Apr 3, 2015, at 3:58 AM, myelinji myeli...@aliyun.com wrote: Thank you for your reply. When I'm using maven to compile the whole project, the erros as follows [INFO] Spark Project Parent POM

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏
I think you need to include the jar file through --jars option that contains the hive definition (code) of UDF json_tuple. That should solve your problem. On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote: I placed it there. It was downloaded from MySql site. On Fri, Apr 3,

Which OS for Spark cluster nodes?

2015-04-03 Thread Horsmann, Tobias
Hi, Are there any recommendations for operating systems that one should use for setting up Spark/Hadoop nodes in general? I am not familiar with the differences between the various linux distributions or how well they are (not) suited for cluster set-ups, so I wondered if there is some

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Started the spark shell with the one jar from hive suggested: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar Results in the same

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
Copy pasted his command in the same thread. Thanks Best Regards On Fri, Apr 3, 2015 at 3:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Akhil you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar . how come you got this lib into spark/lib folder. 1) did you place it there

Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Akhil Das
There isn't any specific Linux distro, but i would prefer Ubuntu for a beginner as its very easy to apt-get install stuffs on it. Thanks Best Regards On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias tobias.horsm...@uni-due.de wrote: Hi, Are there any recommendations for operating systems

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
This thread might give you some insights http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E Thanks Best Regards On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: My Spark Job

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
What version of Cassandra are you using? Are you using DSE or the stock Apache Cassandra version? I have connected it with DSE, but have not attempted it with the standard Apache Cassandra version. FWIW,

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet
Hi Debasish, Charles, I solved the problem by using a BPQ like method, based on your suggestions. So thanks very much for that! My approach was 1) Count the population of each segment in the RDD by map/reduce so that I get the bound number N equivalent to 10% of each segment. This becomes the

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Dean Wampler
A hack workaround is to use flatMap: rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) } For those of you who don't know Scala, the for comprehension iterates through the ArrayBuffer, named array and yields new tuples with the date and each element. The case expression to the

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain
I was able to write record that extends specificrecord (avro) this class was not auto generated. Do we need to do something extra for auto generated classes Sent from my iPhone On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote: This thread might give you some insights

Re: Spark 1.3 UDF ClassNotFoundException

2015-04-03 Thread Markus Ganter
My apologizes. I was running this locally and the JAR I was building using Intellij had some issues. This was not related to UDFs. All works fine now. On Thu, Apr 2, 2015 at 2:58 PM, Ted Yu yuzhih...@gmail.com wrote: Can you show more code in CreateMasterData ? How do you run your code ?

Parquet timestamp support for Hive?

2015-04-03 Thread Rex Xiong
Hi, I got this error when creating a hive table from parquet file: DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384 I check HIVE-6384, it's fixed in 0.14. The hive in spark build is a customized

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain
I meant that I did not have to use kyro. Why will kyro help fix this issue now ? Sent from my iPhone On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote: I was able to write record that extends specificrecord (avro) this class was not auto generated. Do we need to do

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
Because, its throwing up serializable exceptions and kryo is a serializer to serialize your objects. Thanks Best Regards On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain deepuj...@gmail.com wrote: I meant that I did not have to use kyro. Why will kyro help fix this issue now ? Sent from my

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Thanks mohammed. Will give it a try today. We would also need the sparksSQL piece as we are migrating our data store from oracle to C* and it would be easier to maintain all the reports rather recreating each one from scratch. Thanks, Pawan Venugopal. On Apr 3, 2015 7:59 AM, Mohammed Guller

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
Cool ! You should also consider to contribute it back to spark if you are doing quantile calculations for example...there is also topbykey api added in master by @coderxiangsee if you can use that API to make the code clean On Apr 3, 2015 5:20 AM, Aung Htet aung@gmail.com wrote: Hi

Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Charles Feduke
As Akhil says Ubuntu is a good choice if you're starting from near scratch. Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and other big data tools so you can get a cluster running with very little effort. Keep in mind Cloudera is a for-profit corporation so they are also

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Hi Todd, Thanks for the link. I would be interested in this solution. I am using DSE for cassandra. Would you provide me with info on connecting with DSE either through Tableau or zeppelin. The goal here is query cassandra through spark sql so that I could perform joins and groupby on my queries.

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
Hi Todd, We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* using the ODBC driver, but now would like to add Spark SQL to the mix. I haven’t been able to find any documentation for how to make this combination work. We are using the Spark-Cassandra-Connector in our

Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-03 Thread Sean Owen
(That one was already fixed last week, and so should be updated when the site updates for 1.3.1.) On Fri, Apr 3, 2015 at 4:59 AM, Michael Armbrust mich...@databricks.com wrote: Looks like a typo, try: df.select(df(name), df(age) + 1) Or df.select(name, age) PRs to fix docs are always

Re: maven compile error

2015-04-03 Thread Sean Owen
If you're asking about a compile error, you should include the command you used to compile. I am able to compile branch 1.2 successfully with mvn -DskipTests clean package. This error is actually an error from scalac, not a compile error from the code. It sort of sounds like it has not been able

Spark Job Failed - Class not serializable

2015-04-03 Thread ๏̯͡๏
My Spark Job failed with 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏
Akhil you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar . how come you got this lib into spark/lib folder. 1) did you place it there ? 2) What is download location ? On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote: Started the spark shell with the one

答复:maven compile error

2015-04-03 Thread myelinji
Thank you for your reply. When I'm using maven to compile the whole project, the erros as follows[INFO] Spark Project Parent POM .. SUCCESS [4.136s] [INFO] Spark Project Networking .. SUCCESS [7.405s] [INFO] Spark Project Shuffle Streaming Service

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
Hi Mohammed, Not sure if you have tried this or not. You could try using the below api to start the thriftserver with an existing context. https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42 The one

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan Not sure if you have seen this or not, but here is a good example by Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding Tableau is as simple as Mohammed stated with DSE. https://github.com/jlacefie/sparksqltest. HTH, Todd On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist

Re: Matei Zaharai: Reddit Ask Me Anything

2015-04-03 Thread ben lorica
Happening right now https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html Sent from the Apache Spark User List

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being processed. Thats batched waiting in the job queue is not considered in the calculation. It is arguable that it may be more intuitive to count that

MLlib: save models to HDFS?

2015-04-03 Thread S. Zhou
I am new to MLib so I have a basic question: is it possible to save MLlib models (particularly CF models) to HDFS and then reload it later? If yes, could u share some sample code (I could not find it in MLlib tutorial). Thanks!

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Ted Yu
Maybe add another stat for batches waiting in the job queue ? Cheers On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com wrote: Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Maybe that should be marked as waiting as well. Will keep that in mind. We plan to update the ui soon, so will keep that in mind. On Apr 3, 2015 10:12 AM, Ted Yu yuzhih...@gmail.com wrote: Maybe add another stat for batches waiting in the job queue ? Cheers On Fri, Apr 3, 2015 at 10:01 AM,

Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
spark-streaming-kinesis-asl is not part of the Spark distribution on your cluster, so you cannot have it be just a provided dependency. This is also why the KCL and its dependencies were not included in the assembly (but yes, they should be). ~ Jonathan Kelly From: Vadim Bichutskiy

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Remove provided and got the following error: [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class [error]

Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Thanks! I'll add the JIRA. I'll also try to work on a patch this weekend . - -- Ankur Chauhan On 03/04/2015 13:23, Tim Chen wrote: Hi Ankur, There isn't a way to do that yet, but it's simple to add. Can you create a JIRA in Spark for

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan, So it's been a couple of months since I have had a chance to do anything with Zeppelin, but here is a link to a post on what I did to get it working https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI. This may or may not work with the newer releases from Zeppelin.

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this seems to work. I don't understand why though because when I give spark://ip-10-241-251-232:7077 application seem to bootstrap successfully, just doesn't create a socket on port ? On Fri, Mar 27, 2015 at 10:55 AM, Mohit

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Thanks. So how do I fix it? ᐧ On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com wrote: spark-streaming-kinesis-asl is not part of the Spark distribution on your cluster, so you cannot have it be just a provided dependency. This is also why the KCL and its dependencies

Re: WordCount example

2015-04-03 Thread Tathagata Das
What does the Spark Standalone UI at port 8080 say about number of cores? On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com wrote: [ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process processor : 0 processor : 1 processor : 2 processor

Re: Simple but faster data streaming

2015-04-03 Thread Tathagata Das
I am afraid not. The whole point of Spark Streaming is to make it easy to do complicated processing on streaming data while interoperating with core Spark, MLlib, SQL without the operational overheads of maintain 4 different systems. As a slight cost of achieving that unification, there maybe some

Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
Just remove provided from the end of the line where you specify the spark-streaming-kinesis-asl dependency. That will cause that package and all of its transitive dependencies (including the KCL, the AWS Java SDK libraries and other transitive dependencies) to be included in your uber jar.

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Adam Ritter
That doesn't seem like a good solution unfortunately as I would be needing this to work in a production environment. Do you know why the limitation exists for FileInputDStream in the first place? Unless I'm missing something important about how some of the internals work I don't see why this

Re: Spark + Kinesis

2015-04-03 Thread Daniil Osipov
Assembly settings have an option to exclude jars. You need something similar to: assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp = val excludes = Set( minlog-1.2.jar ) cp filter { jar = excludes(jar.data.getName) } } in your build file (may need to be

spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to figure out if there is a way to tell the mesos scheduler in spark to isolate the workers to a set of mesos slaves that have a given attribute such as `tachyon:true`. Anyone knows if that is possible or how I could achieve such a

Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Jeetendra Gangele
Hi All I am building a logistic regression for matching the person data lets say two person object is given with their attribute we need to find the score. that means at side you have 10 millions records and other side we have 1 record , we need to tell which one match with highest score among 1

Re: variant record by case classes in shell fails?

2015-04-03 Thread Michael Albert
My apologies for following my own post, but a friend just pointed out that if I use kryo with reference counting AND copy-and-paste, this runs. However, if I try to load file, this fails as described below. I thought load was supposed to be equivalent? Thanks!-Mike From: Michael Albert

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
Yes, definitely can be added. Just haven't gotten around to doing it :) There are proposals for this that you can try - https://github.com/apache/spark/pull/2765/files . Have you review it at some point. On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter adamge...@gmail.com wrote: That doesn't seem

Spark Streaming FileStream Nested File Support

2015-04-03 Thread adamgerst
So after pulling my hair out for a bit trying to convert one of my standard spark jobs to streaming I found that FileInputDStream does not support nested folders (see the brief mention here http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources the fileStream method

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
[ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process processor : 0 processor : 1 processor : 2 processor : 3 processor : 4 processor : 5 processor : 6 processor : 7 On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Hi Todd, Thanks for the help. So i was able to get the DSE working with tableau as per the link provided by Mohammed. Now i trying to figure out if i could write sparksql queries from tableau and get data from DSE. My end goal is to get a web based tool where i could write sql queries which will

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng
In 1.3, you can use model.save(sc, hdfs path). You can check the code examples here: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples. -Xiangrui On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote: Hello Zhou, You can look at the

Re: WordCount example

2015-04-03 Thread Tathagata Das
How many cores are present in the works allocated to the standalone cluster spark://ip-10-241-251-232:7077 ? On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com wrote: If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this seems to work. I don't understand

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
Thanks, Todd. It is an interesting idea; worth trying. I think the cash project is old. The tuplejump guy has created another project called CalliopeServer2, which works like a charm with BI tools that use JDBC, but unfortunately Tableau throws an error when it connects to it. Mohammed From:

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
@Todd, I had looked at it yesterday. All these dependencies explained is added in the DSE node. Do I need to include spark and DSE dependencies in the Zeppline node? I built zeppelin with no spark and no hadoop. To my understanding zeppelin will send a request to a remote master at spark://

Re: Spark + Kinesis

2015-04-03 Thread Tathagata Das
Just remove provided for spark-streaming-kinesis-asl libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Thanks. So how do I fix it? ᐧ On Fri, Apr 3, 2015 at 3:43 PM, Kelly,

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
I sort-a-hacky workaround is to use a queueStream where you can manually create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note that this is for testing only as queueStream does not work with driver fautl recovery. TD On Fri, Apr 3, 2015 at 12:23 PM, adamgerst

Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Tim Chen
Hi Ankur, There isn't a way to do that yet, but it's simple to add. Can you create a JIRA in Spark for this? Thanks! Tim On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan achau...@brightcove.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to figure out if there is

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
Hello Zhou, You can look at the recommendation template http://templates.prediction.io/PredictionIO/template-scala-parallel-recommendation of PredictionIO. PredictionIO is built on the top of spark. And this template illustrates how you can save the ALS model to HDFS and the reload it later.

Spark TeraSort source request

2015-04-03 Thread Tom
Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the

Re: Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Joseph Bradley
If you can examine your data matrix and know that about 1/6 or so of the values are non-zero (so 5/6 are zeros), then it's probably worth using sparse vectors. (1/6 is a rough estimate.) There is support for L1 and L2 regularization. You can look at the guide here:

Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-03 Thread Ritesh Kumar Singh
Hi, Are there any tutorials that explains all the changelogs between Spark 0.8.0 and Spark 1.3.0 and how can we approach this issue.

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
Thanks Mohammed, I was aware of Calliope, but haven't used it since with since the spark-cassandra-connector project got released. I was not aware of the CalliopeServer2; cool thanks for sharing that one. I would appreciate it if you could lmk how you decide to proceed with this; I can see this

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
Sweet - I'll have to play with this then! :) On Fri, Apr 3, 2015 at 19:43 Reynold Xin r...@databricks.com wrote: There is already an explode function on DataFrame btw https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712 I think

Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
Hi Xingrui, I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and attached the sample code. But I could not attache the test data. I will update the bug once I found a place to host the test data. Thanks, David On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng men...@gmail.com

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Reynold Xin
There is already an explode function on DataFrame btw https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712 I think something like this would work. You might need to play with the type. df.explode(arrayBufferColumn) { x = x } On Fri,

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
Thanks Dean - fun hack :) On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote: A hack workaround is to use flatMap: rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) } For those of you who don't know Scala, the for comprehension iterates through the

Re: Reading a large file (binary) into RDD

2015-04-03 Thread Dean Wampler
This might be overkill for your needs, but the scodec parser combinator library might be useful for creating a parser. https://github.com/scodec/scodec Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe

Spark unit test fails

2015-04-03 Thread Manas Kar
Hi experts, I am trying to write unit tests for my spark application which fails with javax.servlet.FilterRegistration error. I am using CDH5.3.2 Spark and below is my dependencies list. val spark = 1.2.0-cdh5.3.2 val esriGeometryAPI = 1.2 val csvWriter = 1.0.0

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Frank Austin Nothaft
You’ll definitely want to use a Kryo-based serializer for Avro. We have a Kryo based serializer that wraps the Avro efficient serializer here. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 3, 2015, at 5:41 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Reading a large file (binary) into RDD

2015-04-03 Thread Vijayasarathy Kannan
Thanks everyone for the inputs. I guess I will try out a custom implementation of InputFormat. But I have no idea where to start. Are there any code examples of this that might help? On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote: This might be overkill for your

Spark Memory Utilities

2015-04-03 Thread Stephen Carman
I noticed spark has some nice memory tracking estimators in it, but they are private. We have some custom implementations of RDD and PairRDD to suit our internal needs and it’d be fantastic if we’d be able to just leverage the memory estimates that already exist in Spark. Is there any change

RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964
Hadoop TextInputFormat is a good start. It is not really that hard. You just need to implement the logic to identify the Record delimiter, and think a logic way to represent the Key, Value for your RecordReader. Yong From: kvi...@vt.edu Date: Fri, 3 Apr 2015 11:41:13 -0400 Subject: Re: Reading