RE: SparkSQL can not use SchemaRDD from Hive

2014-07-29 Thread Cheng, Hao
In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually binds a certain SQLContext in runtime, I don't think we can manipulate/share the SchemaRDD across SQLContext Instances. -Original Message- From: Kevin Jung [mailto:itsjb.j...@samsung.com] Sent: Tuesday, July

Joining spark user group

2014-07-29 Thread jitendra shelar

Re: SparkSQL can not use SchemaRDD from Hive

2014-07-29 Thread Zongheng Yang
As Hao already mentioned, using 'hive' (the HiveContext) throughout would work. On Monday, July 28, 2014, Cheng, Hao hao.ch...@intel.com wrote: In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually binds a certain SQLContext in runtime, I don't think we can

Re: How true is this about spark streaming?

2014-07-29 Thread Sean Owen
I'm not sure I understand this, maybe because the context is missing. An RDD is immutable, so there is no such thing as writing to an RDD. I'm not sure which aspect is being referred to as single-threaded. Is this the Spark Streaming driver? What is the difference between streaming into Spark and

SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread John Wu
Hi all, There is a problem we can’t resolve. We implement the OWLQN algorithm in parallel with SPARK, We don’t know why It is very slow in every iteration stage, but the load of CPU and Memory of each executor are so low that it seems impossible to make the the every step slow. And

Re: UpdatestateByKey assumptions

2014-07-29 Thread RodrigoB
Just an update on this, Looking into Spark logs seems that some partitions are not found and recomputed. Gives the impression that those are related with the delayed updatestatebykey calls. I'm seeing something like: log line 1 - Partition rdd_132_1 not found, computing it log line N -

Avro Schema + GenericRecord to HadoopRDD

2014-07-29 Thread Laird, Benjamin
Hi all, I can read in Avro files to Spark with HadoopRDD and submit the schema in the jobConf, but with the guidance I've seen so far, I'm left with a avro GenericRecord of Java objects without type. How do I actually use the schema to have the types inferred? Example: scala

Unit Testing (JUnit) with Spark

2014-07-29 Thread soumick86
Is there any example out there for unit testing a Spark application in Java? Even a trivial application like word count will be very helpful. I am very new to this and I am struggling to understand how I can use JavaSpark Context for JUnit -- View this message in context:

Job using Spark for Machine Learning

2014-07-29 Thread Martin Goodson
I'm not sure if job adverts are allowed on here - please let me know if not. Otherwise, if you're interested in using Spark in an RD machine learning project then please get in touch. We are a startup based in London. Our data sets are on a massive scale- we collect data on over a billion users

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
I've been working some on building spark blueprints, and recently tried to generalize one for easy blueprints of spark apps. https://github.com/jayunit100/SparkBlueprint.git It runs the spark app's main method in a unit test, and builds in SBT. You can easily try it out and improve on it.

Re: SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread Xiangrui Meng
Do you mind sharing more details, for example, specs of nodes and data size? -Xiangrui 2014-07-29 2:51 GMT-07:00 John Wu j...@zamplus.com: Hi all, There is a problem we can’t resolve. We implement the OWLQN algorithm in parallel with SPARK, We don’t know why It is very slow in every

the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Denis RP
Hi, I'm running a spark standalone cluster to calculate single source shortest path. Here is the code, VertexRDD[(String, Long)], String for the path and Long for the distance codes before these lines related to reading graph data from file and building the graph. 71 val sssp =

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Kostiantyn Kudriavtsev
Hi, try this one http://simpletoad.blogspot.com/2014/07/runing-spark-unit-test-on-windows-7.html it’s more about fixing windows-specific issue, but code snippet gives general idea just run etl and check output w/ Assert(s) On Jul 29, 2014, at 6:29 PM, soumick86 sdasgu...@dstsystems.com

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Sonal Goyal
You can take a look at https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java and model your junits based on it. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Jul 29, 2014 at 10:10 PM,

Re: Memory compute-intensive tasks

2014-07-29 Thread rpandya
OK, I did figure this out. I was running the app (avocado) using spark-submit, when it was actually designed to take command line arguments to connect to a spark cluster. Since I didn't provide any such arguments, it started a nested local Spark cluster *inside* the YARN Spark executor and so of

GraphX Connected Components

2014-07-29 Thread Jeffrey Picard
Hey all, I’m currently trying to run connected components using GraphX on a large graph (~1.8b vertices and ~3b edges, most of them are self edges where the only edge that exists for vertex v is v-v) on emr using 50 m3.xlarge nodes. As the program runs I’m seeing each iteration take longer and

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context:

RE: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-29 Thread nikroy16
Thanks for the response... hive-site.xml is in the classpath so that doesn't seem to be the issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-is-creating-metastore-warehouse-locally-instead-of-in-hdfs-tp10838p10871.html Sent from the Apache

Re: iScala or Scala-notebook

2014-07-29 Thread Nick Pentreath
IScala itself seems to be a bit dead unfortunately. I did come across this today: https://github.com/tribbloid/ISpark On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989 ericjohnston1...@gmail.com wrote: Hey everyone, I know this was asked before but I'm wondering if there have since been

python project like spark-jobserver?

2014-07-29 Thread Chris Grier
I'm looking for something like the ooyala spark-jobserver ( https://github.com/ooyala/spark-jobserver) that basically manages a SparkContext for use from a REST or web application environment, but for python jobs instead of scala. Has anyone written something like this? Looking for a project or

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-29 Thread Michael Armbrust
The warehouse and the metastore directories are two different things. The metastore holds the schema information about the tables and will by default be a local directory. With javax.jdo.option.ConnectionURL you can configure it to be something like mysql. The warehouse directory is the default

Re: iScala or Scala-notebook

2014-07-29 Thread andy petrella
Some people started some work on that topic using the notebook (the original or the n8han one, cannot remember)... Some issues have ben created already ^^ Le 29 juil. 2014 19:59, Nick Pentreath nick.pentre...@gmail.com a écrit : IScala itself seems to be a bit dead unfortunately. I did come

Using countApproxDistinct in pyspark

2014-07-29 Thread Diederik
Heya, I would like to use countApproxDistinct in pyspark, I know that it's an experimental method and that it is not yet available in pyspark. I started with porting the countApproxDistinct unit-test to Python, see https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the results are

Spark and Flume integration - do I understand this correctly?

2014-07-29 Thread dapooley
Hi, I am trying to integrate Spark onto a Flume log sink and avro source. The sink is on one machine (the application), and the source is on another. Log events are being sent from the application server to the avro source server (a log directory sink on the arvo source prints to verify) The aim

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
Yifan LI iamyifa...@gmail.com writes: Maybe you could get the vertex, for instance, which id is 80, by using: graph.vertices.filter{case(id, _) = id==80}.collect but I am not sure this is the exactly efficient way.(it will scan the whole table? if it can not get benefit from index of

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
i just looked at my dependencies in sbt, and when using cdh4.5.0 dependencies i see that hadoop clients pulls in jboss netty (via zookeeper) and asm 3.x (via jersey-server). so somehow these exclusion rules are not working anymore? i will look into sbt-pom-reader a bit to try to understand whats

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Ankur Dave
Denis RP qq378789...@gmail.com writes: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 6.0:4 failed 4 times, most recent failure: Exception failure in TID 598 on host worker6.local: java.lang.NullPointerException [error]

Example standalone app error!

2014-07-29 Thread Alex Minnaar
I am trying to run an example Spark standalone app with the following code import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object SparkGensimLDA extends App{ val ssc=new StreamingContext(local,testApp,Seconds(5)) val

Re: Job using Spark for Machine Learning

2014-07-29 Thread Matei Zaharia
Hi Martin, Job ads are actually not allowed on the list, but thanks for asking. Just posting this for others' future reference. Matei On July 29, 2014 at 8:34:59 AM, Martin Goodson (mar...@skimlinks.com) wrote: I'm not sure if job adverts are allowed on here - please let me know if not. 

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote: Development is really rapid

Re: java.io.StreamCorruptedException: invalid type code: 00

2014-07-29 Thread Alexis Roos
Just realized that I was missing the JavaSparkContext in the import and after adding it, the error is: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: java.lang.reflect.Method at

RE: Avro Schema + GenericRecord to HadoopRDD

2014-07-29 Thread Severs, Chris
Hi Benjamin, I think the best bet would be to use the Avro code generation stuff to generate a SpecificRecord for your schema and then change the reader to use your specific type rather than GenericRecord. Trying to read up the generic record and then do type inference and spit out a tuple

Re: How true is this about spark streaming?

2014-07-29 Thread Tobias Pfeiffer
Hi, that quoted statement doesn't make too much sense for me, either. Maybe if you had a link for us that shows the context (Google doesn't reveal anything but this conversation), we could evaluate that statement better. Tobias On Tue, Jul 29, 2014 at 5:53 PM, Sean Owen so...@cloudera.com

How to submit Pyspark job in mesos?

2014-07-29 Thread daijia
sched.cpp:217] New master detected at master@192.168.3.91:5050 I0729 18:40:50.442570 15035 sched.cpp:225] No credentials provided. Attempting to register without authentication I0729 18:40:50.443234 15036 sched.cpp:391] Framework registered with 20140729-174911-1526966464-5050-13758-0006 14/07/29 18:40:50

How do you debug a PythonException?

2014-07-29 Thread Nick Chammas
I’m in the PySpark shell and I’m trying to do this: a = sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json', minPartitions=sc.defaultParallelism * 3).cache() a.map(lambda x: len(x)).max() My job dies with the following: 14/07/30 01:46:28 WARN TaskSetManager: Loss was

How to specify the job to run on the specific nodes(machines) in the hadoop yarn cluster?

2014-07-29 Thread adu
Hi all, RT. I want to run a job on specific two nodes in the cluster? How to configure the yarn? Dose yarn queue help? Thanks

Is it possible to read file head in each partition?

2014-07-29 Thread Fengyun RAO
Hi, all We are migrating from mapreduce to spark, and encountered a problem. Our input files are IIS logs with file head. It's easy to get the file head if we process only one file, e.g. val lines = sc.textFile('hdfs://*/u_ex14073011.log') val head = lines.take(4) Then we can write our map

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Nicholas Chammas
This is an interesting question. I’m curious to know as well how this problem can be approached. Is there a way, perhaps, to ensure that each input file matching the glob expression gets mapped to exactly one partition? Then you could probably get what you want using RDD.mapPartitions(). Nick ​

Re: Using Spark Streaming with Kafka 0.7.2

2014-07-29 Thread Andre Schumacher
Hi, For testing you could also just use the Kafka 0.7.2 console consumer and pipe it's output to netcat (nc) and process that as in the example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala That worked for me.

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
Could you share more details about the dataset and the algorithm? For example, if the dataset has 10M+ features, it may be slow for the driver to collect the weights from executors (just a blind guess). -Xiangrui On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim unname...@gmail.com wrote: Hi, all

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Fengyun RAO
It will certainly cause bad performance, since it reads the whole content of a large file into one value, instead of splitting it into partitions. Typically one file is 1 GB. Suppose we have 3 large files, in this way, there would only be 3 key-value pairs, and thus 3 tasks at most. 2014-07-30

Re: How to submit Pyspark job in mesos?

2014-07-29 Thread daijia
Actually, it runs okay in my slaves deployed by standalone mode. When I switch to mesos, the error just occurs. Anyway, thanks for your reply and any ideas will help. -- View this message in context:

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Denis RP
I build it with sbt package, I run it with sbt run, and I do use SparkConf.set for deployment options and external jars. It seems that spark-submit can't load extra jars and will lead to noclassdeffounderror, should I pack all the jars to a giant one and give it a try? I run it on a cluster of 8

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
The weight vector is usually dense and if you have many partitions, the driver may slow down. You can also take a look at the driver memory inside the Executor tab in WebUI. Another setting to check is the HDFS block size and whether the input data is evenly distributed to the executors. Are the

Re: zip two RDD in pyspark

2014-07-29 Thread Davies Liu
On Mon, Jul 28, 2014 at 12:58 PM, l lishu...@gmail.com wrote: I have a file in s3 that I want to map each line with an index. Here is my code: input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache() N input_data.count() index = sc.parallelize(range(N), 6)

Re: How to submit Pyspark job in mesos?

2014-07-29 Thread Davies Liu
Maybe mesos or spark was not configured correctly, could you check the log files in mesos slaves? It should log the reason when mesos can not lunch the executor. On Tue, Jul 29, 2014 at 10:39 PM, daijia jia_...@intsig.com wrote: Actually, it runs okay in my slaves deployed by standalone mode.