Re: Add StructType column to SchemaRDD

2015-01-05 Thread Michael Armbrust
The types expected by applySchema are documented in the type reference section: http://spark.apache.org/docs/latest/sql-programming-guide.html#spark-sql-datatype-reference I'd certainly accept a PR to improve the docs and add a link to this from the applySchema section :) Can you explain why you

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi Michael, On Tue, Jan 6, 2015 at 3:43 PM, Michael Armbrust mich...@databricks.com wrote: Oh sorry, I'm rereading your email more carefully. Its only because you have some setup code that you want to amortize? Yes, exactly that. Concerning the docs, I'd be happy to contribute, but I don't

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Michael Armbrust
Oh sorry, I'm rereading your email more carefully. Its only because you have some setup code that you want to amortize? On Mon, Jan 5, 2015 at 10:40 PM, Michael Armbrust mich...@databricks.com wrote: The types expected by applySchema are documented in the type reference section:

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-05 Thread Simon Chan
PredictionIO comes with a event server that handles data collection: http://docs.prediction.io/datacollection/overview/ It's based on HBase, which works fine with Spark as the data store of the event/training data. You probably need a separate CRUD-supported database for your application. Your

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Thanks Zhan, I'm also confused about the jstack output, why the driver gets stuck at org.apache.spark.SparkContext.clean ? On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang zzh...@hortonworks.com wrote: I think it is overflow. The training data is quite big. The algorithms scalability highly

TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-05 Thread Priya Ch
Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? *Following is the code:* def main(args:Array[String]) { val conf=new SparkConf val sc=new

Re: spark.akka.frameSize limit error

2015-01-05 Thread Saeed Shahrivari
Hi, Thanks for the prompt reply. I checked the code. The main issue is the large number of mappers. If the number of mappers is set to some number around 1000, there will be no problem. I hope the bug gets fixed in the next releases. On Mon, Jan 5, 2015 at 1:26 AM, Josh Rosen

stopping streaming context

2015-01-05 Thread Hafiz Mujadid
Hi experts! Please is there anyway to stop spark streaming context when 5 batches are completed ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stopping-streaming-context-tp20970.html Sent from the Apache Spark User List mailing list archive at

Finding most occurrences in a JSON Nested Array

2015-01-05 Thread adstan
Hi, I'm pretty new to both Spark (and Scala), so I would like to seek some help here: I have this dataset in JSON: In short, I'm trying to find out the top 10 hobbies, sorted DESC by count. So basically i did: Prints... This is where I got stucked... I tried and got: What do I do with an

spark 1.2: value toJSON is not a member of org.apache.spark.sql.SchemaRDD

2015-01-05 Thread bchazalet
Hi everyone, I have just switched to spark 1.2.0 from 1.1.1, updating my sbt to point to the 1.2.0 jars. org.apache.spark %% spark-core % 1.2.0, org.apache.spark %% spark-sql % 1.2.0, org.apache.spark %% spark-hive % 1.2.0, I was hoping to use

Re: a vague question, but perhaps it might ring a bell

2015-01-05 Thread Michael Albert
Greeting! Thank you very much for taking the time to respond. My apologies, but at the moment I don't have an example that I feel comfortable posting.  Frankly, I've been struggling with variantsof this for the last two weeks and probably won't be able to work on this particular issue for a few

Re: FlatMapValues

2015-01-05 Thread Sean Owen
For the record, the solution I was suggesting was about like this: inputRDD.flatMap { input = val tokens = input.split(',') val id = tokens(0) val keyValuePairs = tokens.tail.grouped(2) val keys = keyValuePairs.map(_(0)) keys.map(key = (id, key)) } This is much more efficient. On Wed,

Re: stopping streaming context

2015-01-05 Thread Akhil Das
Have a look at StreamingListener http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener, Here's an example http://stackoverflow.com/questions/20950268/how-to-stop-spark-streaming-context-when-the-network-connection-tcp-ip-is-clos on how to

Implement customized Join for SparkSQL

2015-01-05 Thread Dai, Kevin
Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a file while B is a database which indexed by id and I wrapped it by Data source API. The desired join flow is: 1. Generate A's RDD[Row] 2. Generate B's RDD[Row] from A by

Override hostname of Spark executors

2015-01-05 Thread christian_t
Hi together, i'm currently setting up a dev-environment for Spark. We are planning to use a commercial, hostname-dongeled 3rd-party-library in our Spark-jobs. The question which arises now: a) Is it possible (maybe on job-level) to tell the Spark-Executor which hostname it should report to the

Re: Mesos resource allocation

2015-01-05 Thread Josh Devins
Hey Tim, sorry for the delayed reply, been on vacation for a couple weeks. The reason we want to control the number of executors is that running executors with JVM heaps over 30GB causes significant garbage collection problems. We have observed this through much trial-and-error for jobs that are

RE: Implement customized Join for SparkSQL

2015-01-05 Thread Cheng, Hao
Can you paste the error log? From: Dai, Kevin [mailto:yun...@ebay.com] Sent: Monday, January 5, 2015 6:29 PM To: user@spark.apache.org Subject: Implement customized Join for SparkSQL Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-05 Thread Sean Owen
In the first instance, I'm suggesting that ALS in Spark could perhaps expose a run() method that accepts a previous MatrixFactorizationModel, and uses the product factors from it as the initial state instead. If anybody seconds that idea, I'll make a PR. The second idea is just fold-in:

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-05 Thread Wouter Samaey
One other idea was that I don’t need to re-train the model, but simply pass all the current user’s recent ratings (including one’s created after the training) to the existing model… Is this a valid option? Wouter Samaey Zaakvoerder Storefront BVBA Tel: +32 472 72 83 07 Web:

Re: Spark Driver behind NAT

2015-01-05 Thread Aaron
Thanks for the link! However, from reviewing the thread, it appears you cannot have a NAT/firewall between the cluster and the spark-driver/shell..is this correct? When the shell starts up, it binds to the internal IP (e.g. 192.168.x.y)..not the external floating IP..which is routable from the

Re: Spark Driver behind NAT

2015-01-05 Thread Akhil Das
You can have a look at this discussion http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html Thanks Best Regards On Mon, Jan 5, 2015 at 6:11 PM, Aaron aarongm...@gmail.com wrote: Hello there, I was wondering if there

Spark Driver behind NAT

2015-01-05 Thread Aaron
Hello there, I was wondering if there is a way to have the spark-shell (or pyspark) sit behind a NAT when talking to the cluster? Basically, we have OpenStack instances that run with internal IPs, and we assign floating IPs as needed. Since the workers make direct TCP connections back, the

Re: different akka versions and spark

2015-01-05 Thread Cody Koeninger
I haven't tried it with spark specifically, but I've definitely run into problems trying to depend on multiple versions of akka in one project. On Sat, Jan 3, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote: hey Ted, i am aware of the upgrade efforts for akka. however if spark 1.2

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
If you need more help let me know -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20976.html Sent from the

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
try as below results.map(row = row(1)).collect try var hobbies = results.flatMap(row = row(1)) It will create all the hobbies in a simpe array nowob hbmap =hobbies.map(hobby =(hobby,1)).reduceByKey((hobcnt1,hobcnt2) =hobcnt1+hobcnt2) It will aggregate hobbies as below {swimming,2},

Add PredictionIO to Powered by Spark

2015-01-05 Thread Thomas Stone
Please can we add PredictionIO to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark PredictionIO http://prediction.io/ PredictionIO is an open source machine learning server for software developers to easily build and deploy predictive applications on production.

Re: different akka versions and spark

2015-01-05 Thread Koert Kuipers
since spark shaded akka i wonder if it would work, but i doubt it On Mon, Jan 5, 2015 at 9:56 AM, Cody Koeninger c...@koeninger.org wrote: I haven't tried it with spark specifically, but I've definitely run into problems trying to depend on multiple versions of akka in one project. On Sat,

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread adstan
I did try this earlier before, but I’ve got an error that I couldn’t comprehend: scala var hobbies = results.flatMap(row = row(1)) console:16: error: type mismatch; found : Any required: TraversableOnce[?] var hobbies = results.flatMap(row = row(1)) I must be missing something,

Saving partial (top 10) DStream windows to hdfs

2015-01-05 Thread Laeeq Ahmed
Hi, I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values. eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) - 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)val

RE: python API for gradient boosting?

2015-01-05 Thread Christopher Thom
Awesome, thanks for creating this -- I wasn't sure of the process for requesting such a thing. I looked at what is required based on the scala code, and it seemed a bit beyond my capability. cheers chris -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, 6

Re: MLLIB and Openblas library in non-default dir

2015-01-05 Thread Xiangrui Meng
It might be hard to do that with spark-submit, because the executor JVMs may be already up and running before a user runs spark-submit. You can try to use `System.setProperty` to change the property at runtime, though it doesn't seem to be a good solution. -Xiangrui On Fri, Jan 2, 2015 at 6:28

Re: python API for gradient boosting?

2015-01-05 Thread Xiangrui Meng
I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone would work on it and make it available in the 1.3 release. -Xiangrui On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom christopher.t...@quantium.com.au wrote: Hi, I wonder if anyone knows when a

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng
How big is your dataset, and what is the vocabulary size? -Xiangrui On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: main prio=10 tid=0x40112800

Spark response times for queries seem slow

2015-01-05 Thread Sam Flint
I am running pyspark job over 4GB of data that is split into 17 parquet files on HDFS cluster. This is all in cloudera manager. Here is the query the job is running : parquetFile.registerTempTable(parquetFileone) results = sqlContext.sql(SELECT sum(total_impressions), sum(total_clicks) FROM

Re: different akka versions and spark

2015-01-05 Thread Marcelo Vanzin
Spark doesn't really shade akka; it pulls a different build (kept under the org.spark-project.akka group and, I assume, with some build-time differences from upstream akka?), but all classes are still in the original location. The upgrade is a little more unfortunate than just changing akka,

Re: Shuffle Problems in 1.2.0

2015-01-05 Thread Sven Krasser
Thanks for the input! I've managed to come up with a repro of the error with test data only (and without any of the custom code in the original script), please see here: https://gist.github.com/skrasser/4bd7b41550988c8f6071#file-gistfile1-md The Gist contains a data generator and the script

Re: Reading from a centralized stored

2015-01-05 Thread Cody Koeninger
If you are not co-locating spark executor processes on the same machines where the data is stored, and using an rdd that knows about which node to prefer scheduling a task on, yes, the data will be pulled over the network. Of the options you listed, S3 and DynamoDB cannot have spark running on

Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
Thanks, that's what I suspected. cheers On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote: If you are not co-locating spark executor processes on the same machines where the data is stored, and using an rdd that knows about which node to prefer scheduling a task on,

Re: Spark response times for queries seem slow

2015-01-05 Thread Cody Koeninger
That sounds slow to me. It looks like your sql query is grouping by a column that isn't in the projections, I'm a little surprised that even works. But you're getting the same time reducing manually? Have you looked at the shuffle amounts in the UI for the job? Are you certain there aren't a

Re: SparkSQL support for reading Avro files

2015-01-05 Thread Michael Armbrust
Did you follow the link on that page? THIS REPO HAS BEEN MOVED https://github.com/marmbrus/sql-avro#please-go-to-the-version-hosted-by-databricksPlease go to the version hosted by databricks https://github.com/databricks/spark-avro On Mon, Jan 5, 2015 at 1:12 PM, yanenli2 yane...@gmail.com

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-05 Thread Simon Chan
Alec, If you are looking for a Machine Learning stack that supports business-logics, you may take a look at PredictionIO: http://prediction.io/ It's based on Spark and HBase. Simon On Mon, Jan 5, 2015 at 6:14 PM, Alec Taylor alec.tayl...@gmail.com wrote: Thanks all. To answer your

Parquet predicate pushdown

2015-01-05 Thread Adam Gilmore
Hi all, I have a question regarding predicate pushdown for Parquet. My understanding was this would use the metadata in Parquet's blocks/pages to skip entire chunks that won't match without needing to decode the values and filter on every value in the table. I was testing a scenario where I had

Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Architect |

Fwd: Controlling number of executors on Mesos vs YARN

2015-01-05 Thread Tim Chen
Forgot to hit reply-all. -- Forwarded message -- From: Tim Chen t...@mesosphere.io Date: Sun, Jan 4, 2015 at 10:46 PM Subject: Re: Controlling number of executors on Mesos vs YARN To: mvle m...@us.ibm.com Hi Mike, You're correct there is no such setting in for Mesos coarse

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-05 Thread Alec Taylor
Thanks all. To answer your clarification questions: - I'm writing this in Python - A similar problem to my actual one is to find common 30 minute slots (over the next 12 months) [r] that k users have in common. Total users: n. Given n=1 and r=17472 then the [naïve] time-complexity is

Re: Parquet predicate pushdown

2015-01-05 Thread Michael Armbrust
Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want:

Re: Mesos resource allocation

2015-01-05 Thread Tim Chen
Hi Josh, I see, I haven't heard folks using larger JVM heap size than you mentioned (30gb), but in your scenario what you're proposing does make sense. I've created SPARK-5095 and we can continue our discussion about how to address this. Tim On Mon, Jan 5, 2015 at 1:22 AM, Josh Devins

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-05 Thread Alec Taylor
Thanks Simon, that's a good way to train on incoming events (and related problems / and result computations). However, does it handle the actual data storage? - E.g.: CRUD documents On Tue, Jan 6, 2015 at 1:18 PM, Simon Chan simonc...@gmail.com wrote: Alec, If you are looking for a Machine

Add StructType column to SchemaRDD

2015-01-05 Thread Tobias Pfeiffer
Hi, I have a SchemaRDD where I want to add a column with a value that is computed from the rest of the row. As the computation involves a network operation and requires setup code, I can't use SELECT *, myUDF(*) FROM rdd, but I wanted to use a combination of: - get schema of input SchemaRDD

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
yes row(1).collect would be wrong as it is not tranformation on RDD try getString(1) to fetch the RDD I already said this is the psuedo code. If it does not help let me know I will run the code and send you get/getAs should work for you for example var hashTagsList =

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Hi Xiangrui, Our dataset is about 80GB(10B lines). In the driver's log, we foud this: *INFO Word2Vec: trainWordsCount = -1610413239* it seems that there is a integer overflow? On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng men...@gmail.com wrote: How big is your dataset, and what is the

Re: JdbcRdd for Python

2015-01-05 Thread Michael Armbrust
I'll add that there is a JDBC connector for the Spark SQL data sources API in the works, and this will work with python (though the standard SchemaRDD type conversions). On Mon, Jan 5, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org wrote: JavaDataBaseConnectivity is, as far as I know, JVM

Spark avro: Sample app fails to run in a spark standalone cluster

2015-01-05 Thread Niranda Perera
Hi, I have this simple spark app. public class AvroSparkTest { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf() .setMaster(spark://niranda-ThinkPad-T540p:7077) // (local[2]) .setAppName(avro-spark-test);

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Zhan Zhang
I think it is overflow. The training data is quite big. The algorithms scalability highly depends on the vocabSize. Even without overflow, there are still other bottlenecks, for example, syn0Global and syn1Global, each of them has vocabSize * vectorSize elements. Thanks. Zhan Zhang On Jan

Re: spark 1.2: value toJSON is not a member of org.apache.spark.sql.SchemaRDD

2015-01-05 Thread Michael Armbrust
I think you are missing something: $ javap -cp ~/Downloads/spark-sql_2.10-1.2.0.jar org.apache.spark.sql.SchemaRDD|grep toJSON public org.apache.spark.rdd.RDDjava.lang.String toJSON(); On Mon, Jan 5, 2015 at 3:11 AM, bchazalet bchaza...@companywatch.net wrote: Hi everyone, I have just

Custom receiver runtime Kryo exception

2015-01-05 Thread contractor
Hello all, I am using Spark 1.0.2 and I have a custom receiver that works well. I tried adding Kryo serialization to SparkConf: val spark = new SparkConf() ….. .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) and I am getting a strange error that I am not sure how to

Question on Spark UI/accumulators

2015-01-05 Thread Virgil Palanciuc
Hi, The Spark documentation states that If accumulators are created with a name, they will be displayed in Spark’s UI http://spark.apache.org/docs/latest/programming-guide.html#accumulators Where exactly are they shown? I may be dense, but I can't find them on the UI from http://localhost:4040

python API for gradient boosting?

2015-01-05 Thread Christopher Thom
Hi, I wonder if anyone knows when a python API will be added for Gradient Boosted Trees? I see that java and scala APIs were added for the 1.2 release, and would love to be able to build GBMs in pyspark too. I find the convenience of being able to use numpy + scipy + matplotlib pretty

Re: FlatMapValues

2015-01-05 Thread Sanjay Subramanian
cool let me adapt that. thanks a tonregardssanjay From: Sean Owen so...@cloudera.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Monday, January 5, 2015 3:19 AM Subject: Re: FlatMapValues For the record, the solution I was

problem while running code

2015-01-05 Thread shahid
the log is here py4j.protocol.Py4JError: An error occurred while calling o22.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at

Timeout Exception in standalone cluster

2015-01-05 Thread rajnish
Hi, I am getting following exception in Spark (1.1.0) Job that is running on Standalone Cluster. My cluster configuration is: Intel(R) 2.50GHz 4 Core 16 GB RAM 5 Machines. Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at

Re: Api to get the status of spark workers

2015-01-05 Thread rajnish
You can use 4040 port, that gives information for current running application. That will give detail summary of currently running executors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Api-to-get-the-status-of-spark-workers-tp20967p20980.html Sent from

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei Begin forwarded message: Date: January 5, 2015 at 9:40:41 AM PST From: Rich Bowen rbo...@rcbowen.com Reply-To: dev d...@community.apache.org To: dev d...@community.apache.org Subject: ApacheCon North America 2015 Call For Papers

Re: JdbcRdd for Python

2015-01-05 Thread Cody Koeninger
JavaDataBaseConnectivity is, as far as I know, JVM specific. The JdbcRDD is expecting to deal with Jdbc Connection and ResultSet objects. I haven't done any python development in over a decade, but if someone wants to work together on a python equivalent I'd be happy to help out. The original

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-01-05 Thread Sam Flint
Below is the code that I am running. I get an error for unresolved attributes. Can anyone point me in the right direction? Running from pyspark shell using yarn MASTER=yarn-client pyspark Error is below code: # Import SQLContext and data types from pyspark.sql import * # sc is an existing

SparkSQL support for reading Avro files

2015-01-05 Thread yanenli2
Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with the latent code of Spark version 1.2.0 or 1.2.1. I then try to pull a copy from github stated at