Re: Bug in Spark SQL and Dataframes : Inferring the Schema Using Reflection?

2015-03-14 Thread Sean Owen
Yep, already fixed in master: https://github.com/apache/spark/pull/4977/files You need a '.toDF()' at the end. On Sat, Mar 14, 2015 at 6:55 PM, Dean Arnold renodino...@gmail.com wrote: Running 1.3.0 from binary install. When executing the example under the subject section from within

order preservation with RDDs

2015-03-14 Thread kian.ho
Hi, I was taking a look through the mllib examples in the official spark documentation and came across the following: http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2 specifically the lines: label = data.map(lambda x: x.label) features = data.map(lambda x:

Re: Bug in Streaming files?

2015-03-14 Thread Sean Owen
No I don't think that much is a bug, since newFilesOnly=false removes a constraint that otherwise exists, and that's what you see. However read the closely related: https://issues.apache.org/jira/browse/SPARK-6061 @tdas open question for you there. On Sat, Mar 14, 2015 at 8:18 PM, Justin Pihony

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName(test).set(spark.executor.memory, 45g).set(spark.cores.max, 62),set(spark.local.dir, C:\\tmp) But still get the error. Do you know how I can config this? Thanks, Best, Peng On Sat, Mar

Bug in Streaming files?

2015-03-14 Thread Justin Pihony
All, Looking into this StackOverflow question https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469 it appears that there is a bug when utilizing the newFilesOnly parameter in FileInputDStream. Before creating a ticket, I wanted to verify it here. The gist is that this

How to create data frame from an avro file in Spark 1.3.0

2015-03-14 Thread Shing Hing Man
In spark-avro 0.1,  the method AvroContext.avroFile  returns a SchemaRDD, which is deprecated in Spark 1.3.0 package com.databricks.spark import org.apache.spark.sql.{SQLContext, SchemaRDD} package object avro {   /**    * Adds a method, `avroFile`, to SQLContext that allows reading data

Bug in Spark SQL and Dataframes : Inferring the Schema Using Reflection?

2015-03-14 Thread Dean Arnold
Running 1.3.0 from binary install. When executing the example under the subject section from within spark-shell, I get the following error: scala people.registerTempTable(people) console:35: error: value registerTempTable is not a member of org.apache.spark.rdd.RDD[Person]

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
And I have 2 TB free space on C driver. On Sat, Mar 14, 2015 at 8:29 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName(test).set(spark.executor.memory, 45g).set(spark.cores.max,

Re: GraphX Snapshot Partitioning

2015-03-14 Thread Takeshi Yamamuro
Large edge partitions could cause java.lang.OutOfMemoryError, and then spark tasks fails. FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex IDs are mapped into 32-bit ones in each partitions. If #edges is over the limit, graphx could throw ArrayIndexOutOfBoundsException,

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Rishi Yadav
programmatically specifying Schema needs import org.apache.spark.sql.type._ for StructType and StructField to resolve. On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote: Yes I think this was already just fixed by: https://github.com/apache/spark/pull/4977 a .toDF() is

Re: [GRAPHX] could not process graph with 230M edges

2015-03-14 Thread Takeshi Yamamuro
Hi, If you have heap problems in spark/graphx, it'd be better to split partitions into smaller ones so as to fit the partition on memory. On Sat, Mar 14, 2015 at 12:09 AM, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: Hello, I cannot process graph with 230M edges. I cloned

deploying Spark on standalone cluster

2015-03-14 Thread sara mustafa
Hi, I am trying to deploy spark on standalone cluster of two machines on for master node and one for worker node. i have defined the two machines in conf/slaves file and also i /etc/hosts, when i tried to run the cluster the worker node is running but the master node failed to run and throw this

How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
Hello, I am got a cluster with spark on yarn. Currently some nodes of it are running a spark streamming program, thus their local space is not enough to support other application. Thus I wonder is that possible to use a blacklist to avoid using these nodes when running a new spark program?

Re: building all modules in spark by mvn

2015-03-14 Thread Sean Owen
I can't reproduce that. 'mvn package' builds everything. You're not showing additional output from Maven that would explain what it skipped and why. On Sat, Mar 14, 2015 at 12:57 AM, sequoiadb mailing-list-r...@sequoiadb.com wrote: guys, is there any easier way to build all modules by mvn ?

Streaming linear regression example question

2015-03-14 Thread Margus Roo
Hi I try to understand example provided in https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - Streaming linear regression Code: import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.mllib.linalg.Vectors import

Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hi, I read this document, http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried to build a TF-IDF model of my documents. I have a list of documents, each word is represented as a Int, and each document is listed in one line. doc_name, int1, int2... doc_name, int3, int4...

Re: spark there is no space on the disk

2015-03-14 Thread Sean Owen
It means pretty much what it says. You ran out of space on an executor (not driver), because the dir used for serialization temp files is full (not all volumes). Set spark.local.dirs to something more appropriate and larger. On Sat, Mar 14, 2015 at 2:10 AM, Peng Xia sparkpeng...@gmail.com wrote:

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Ted Yu
Which release of hadoop are you using ? Can you utilize node labels feature ? See YARN-2492 and YARN-796 Cheers On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com wrote: Hello, I am got a cluster with spark on yarn. Currently some nodes of it are running a spark streamming

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
My hadoop version is 2.2.0, and my spark version is 1.2.0 2015-03-14 17:22 GMT+08:00 Ted Yu yuzhih...@gmail.com: Which release of hadoop are you using ? Can you utilize node labels feature ? See YARN-2492 and YARN-796 Cheers On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com

Re: Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hey, I work it out myself :) The Vector is actually a SparesVector, so when it is written into a string, the format is (size, [coordinate], [value...]) Simple! On Sat, Mar 14, 2015 at 6:05 PM Xi Shen davidshe...@gmail.com wrote: Hi, I read this document,

Re: deploying Spark on standalone cluster

2015-03-14 Thread fightf...@163.com
Hi, You may want to check your spark environment config in spark-env.sh, specifically for the SPARK_LOCAL_IP and check that whether you did modify that value, which may default be localhost. Thanks, Sun. fightf...@163.com From: sara mustafa Date: 2015-03-14 15:13 To: user Subject: deploying

Re: deploying Spark on standalone cluster

2015-03-14 Thread fightf...@163.com
Hi, You may want to check your spark environment config in spark-env.sh, specifically for the SPARK_LOCAL_IP and check that whether you did modify that value, which may default be localhost. Thanks, Sun. fightf...@163.com From: sara mustafa Date: 2015-03-14 15:13 To: user Subject:

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Simon Elliston Ball
You won’t be able to use YARN labels on 2.2.0. However, you only need the labels if you want to map containers on specific hardware. In your scenario, the capacity scheduler in YARN might be the best bet. You can setup separate queues for the streaming and other jobs to protect a percentage of

Re: serialization stakeoverflow error during reduce on nested objects

2015-03-14 Thread alexis GILLAIN
I haven't register my class in kryo but I dont think it would have such an impact on the stack size. I'm thinking of using graphx and I'm wondering how it serializes the graph object as it can use kryo as serializer. 2015-03-14 6:22 GMT+01:00 Ted Yu yuzhih...@gmail.com: Have you registered

Re: Using rdd methods with Dstream

2015-03-14 Thread Laeeq Ahmed
Thanks TD, this is what I was looking for. rdd.context.makeRDD worked. Laeeq On Friday, March 13, 2015 11:08 PM, Tathagata Das t...@databricks.com wrote: Is the number of top K elements you want to keep small? That is, is K small? In which case, you can1.  either do it in the

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Ted Yu
Out of curiosity, I searched for 'capacity scheduler deadlock' yielded the following: [YARN-3265] CapacityScheduler deadlock when computing absolute max avail capacity (fix for trunk/branch-2) [YARN-3251] Fix CapacityScheduler deadlock when computing absolute max avail capacity (short term fix

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Sean Owen
Yes I think this was already just fixed by: https://github.com/apache/spark/pull/4977 a .toDF() is missing On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I've found people.toDF gives you a data frame (roughly equivalent to the previous Row RDD), And you can

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
It’s a long story but there are many dirs with smallish part- files in them so we create a list of the individual files as input to sparkContext.textFile(fileList). I suppose we could move them and rename them to be contiguous part- files in one dir. Would that be better than passing

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Nick Pentreath
I've found people.toDF gives you a data frame (roughly equivalent to the previous Row RDD), And you can then call registerTempTable on that DataFrame. So people.toDF.registerTempTable(people) should work — Sent from Mailbox On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com wrote: We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The

Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-14 Thread eric wong
you seem like not to note the configuration varible spreadOutApps And it's comment: // As a temporary workaround before better ways of configuring memory, we allow users to set // a flag that will perform round-robin scheduling across the nodes (spreading out each app // among all the

Spark Release 1.3.0 DataFrame API

2015-03-14 Thread David Mitchell
I am pleased with the release of the DataFrame API. However, I started playing with it, and neither of the two main examples in the documentation work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html Specfically: - Inferring the Schema Using Reflection - Programmatically

Pausing/throttling spark/spark-streaming application

2015-03-14 Thread tulinski
Hi, I created a question on StackOverflow: http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application I would appreciate your help. Best, Tomek -- View this message in context:

Spark and HBase join issue

2015-03-14 Thread francexo83
Hi all, I have the following cluster configurations: - 5 nodes on a cloud environment. - Hadoop 2.5.0. - HBase 0.98.6. - Spark 1.2.0. - 8 cores and 16 GB of ram on each host. - 1 NFS disk with 300 IOPS mounted on host 1 and 2. - 1 NFS disk with 300 IOPS mounted on host

Re: Spark and HBase join issue

2015-03-14 Thread Ted Yu
The 4.1 GB table has 3 regions. This means that there would be at least 2 nodes which don't carry its region. Can you split this table into 12 (or more) regions ? BTW what's the value for spark.yarn.executor.memoryOverhead ? Cheers On Sat, Mar 14, 2015 at 10:52 AM, francexo83

Re: Need Advice about reading lots of text files

2015-03-14 Thread Michael Armbrust
Here is how I have dealt with many small text files (on s3 though this should generalize) in the past: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E FromMichael Armbrust

Re: Spark SQL 1.3 max operation giving wrong results

2015-03-14 Thread Michael Armbrust
Do you have an example that reproduces the issue? On Fri, Mar 13, 2015 at 4:12 PM, gtinside gtins...@gmail.com wrote: Hi , I am playing around with Spark SQL 1.3 and noticed that max function does not give the correct result i.e doesn't give the maximum value. The same query works fine in