Re: pyspark and hdfs file name

2014-11-14 Thread Oleg Ruchovets
Hi Devies. Thank you for the quick answer. I have a code like this: sc = SparkContext(appName=TAD) lines = sc.textFile(sys.argv[1], 1) result = lines.map(doSplit).groupByKey().map(lambda (k,vc): traffic_process_model(k,vc)) result.saveAsTextFile(sys.argv[2]) Can you please give short

Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Deep Pradhan
val iter = toLocalIterator (rdd) This is what I am doing and it says error: not found On Fri, Nov 14, 2014 at 12:34 PM, Patrick Wendell pwend...@gmail.com wrote: It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as

Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Andrew Ash
Deep, toLocalIterator is a method on the RDD class. So try this instead: rdd.toLocalIterator() On Fri, Nov 14, 2014 at 12:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: val iter = toLocalIterator (rdd) This is what I am doing and it says error: not found On Fri, Nov 14, 2014 at

same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread aaronlin
Hi folks, Although spark-1977 said that this problem is resolved in 1.0.2, but I will have this problem while running the script in AWS EC2 via spark-c2.py. I checked spark-1977 and found that twitter.chill resolve the problem in v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6

Spark Memory Hungry?

2014-11-14 Thread TJ Klein
Hi, I am using PySpark (1.1) and I am using it for some image processing tasks. The images (RDD) are of in the order of several MB to low/mid two digit MB. However, when using the data and running operations on it using Spark, I experience blowing up memory. Is there anything I can do about it? I

Re: Spark Memory Hungry?

2014-11-14 Thread Andrew Ash
TJ, what was your expansion factor between image size on disk and in memory in pyspark? I'd expect in memory to be larger due to Java object overhead, but don't know the exact amounts you should expect. On Fri, Nov 14, 2014 at 12:50 AM, TJ Klein tjkl...@gmail.com wrote: Hi, I am using

Re: EmptyRDD

2014-11-14 Thread Ted Yu
See http://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/EmptyRDD.html On Nov 14, 2014, at 2:09 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: How to create an empty RDD in Spark? Thank You

Re: EmptyRDD

2014-11-14 Thread Gerard Maas
If I remember correctly, EmptyRDD is private [spark] You can create an empty RDD using the spark context: val emptyRdd = sc.emptyRDD -kr, Gerard. On Fri, Nov 14, 2014 at 11:22 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: To get an empty RDD, I did this: I have an rdd with one

Spark streaming fault tolerance question

2014-11-14 Thread François Garillot
Hi guys, I have a question about how the basics of D-Streams, accumulators, failure and speculative execution interact. Let's say I have a streaming app that takes a stream of strings, formats them (let's say it converts each to Unicode), and prints them (e.g. on a news ticker). I know print()

Re: EmptyRDD

2014-11-14 Thread Gerard Maas
It looks like an Scala issue. Seems like the implicit conversion to ArrayOps does not apply if the type is Array[Nothing]. Try giving a type to the empty RDD: val emptyRdd: RDD[Any] = sc.EmptyRDD emptyRdd.collect.foreach(println) // prints a line return -kr, Gerard. On Fri, Nov 14, 2014 at

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Which version are you using? You probably hit this bug https://issues.apache.org/jira/browse/SPARK-3421 if some field name in the JSON contains characters other than [a-zA-Z0-9_]. This has been fixed in https://github.com/apache/spark/pull/2563 On 11/14/14 6:35 PM, vdiwakar.malladi wrote:

Re: saveAsParquetFile throwing exception

2014-11-14 Thread vdiwakar.malladi
Thanks for your response. I'm using Spark 1.1.0 Currently I have the spark setup which comes with Hadoop CDH (using cloudera manager). Could you please suggest me, how can I make use of the patch? Thanks in advance. -- View this message in context:

Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-14 Thread Akhil Das
It shows nullPointerException, your data could be corrupted? Try putting a try catch inside the operation that you are doing, Are you running the worker process on the master node also? If not, then only 1 node will be doing the processing. If yes, then try setting the level of parallelism and

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Hm, I'm not sure whether this is the official way to upgrade CDH Spark, maybe you can checkout https://github.com/cloudera/spark, apply required patches, and then compile your own version. On 11/14/14 8:46 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using Spark 1.1.0 Currently

Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
like this? val file = sc.textFile(hdfs://localhost:9000/sigmoid/input.txt) Thanks Best Regards On Fri, Nov 14, 2014 at 9:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java.

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
Can you not create SparkContext inside the scheduler code? If you are looking just to access hdfs then you can use the following object with it, you can create/read/write files. val hdfs = org.apache.hadoop.fs.FileSystem.get(new URI(hdfs://localhost:9000), hadoopConf) Thanks Best Regards On

User Authn and Authz in Spark missing ?

2014-11-14 Thread Zeeshan Ali Shah
Hi, I am facing an issue as a Cloud Sysadmin , when Spark master launched on public IPs any one who knows the URL of spark can submit the jobs to it . Any way/hack to have a Authn and Authz in spark . i tried to look into it but could not find .. any hint ? -- Regards Zeeshan Ali Shah

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
[image: Inline image 1] Thanks Best Regards On Fri, Nov 14, 2014 at 9:18 PM, Bui, Tri tri@verizonwireless.com.invalid wrote: It should be val file = sc.textFile(hdfs:///localhost:9000/sigmoid/input.txt) 3 “///” Thanks Tri *From:* rapelly kartheek

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
I'll just try out with object Akhil provided. There was no problem working in shell with sc.textFile. Thank you Akhil and Tri. On Fri, Nov 14, 2014 at 9:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote: [image: Inline image 1] Thanks Best Regards On Fri, Nov 14, 2014 at 9:18 PM, Bui,

Set worker log configuration when running local[n]

2014-11-14 Thread Jim Carroll
How do I set the log level when running local[n]? It ignores the log4j.properties file on my classpath. I also tried to set the spark home dir on the SparkConfig using setSparkHome and made sure an appropriate log4j.properties file was in a conf subdirectory and that didn't work either. I'm

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
Hi Akhil, I face error: not found : value URI On Fri, Nov 14, 2014 at 9:29 PM, rapelly kartheek kartheek.m...@gmail.com wrote: I'll just try out with object Akhil provided. There was no problem working in shell with sc.textFile. Thank you Akhil and Tri. On Fri, Nov 14, 2014 at 9:21 PM,

Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Simone Franzini
Let's say I have to apply a complex sequence of operations to a certain RDD. In order to make code more modular/readable, I would typically have something like this: object myObject { def main(args: Array[String]) { val rdd1 = function1(myRdd) val rdd2 = function2(rdd1) val rdd3 =

Re: Set worker log configuration when running local[n]

2014-11-14 Thread Jim Carroll
Actually, it looks like it's Parquet logging that I don't have control over. For some reason the parquet project decided to use java.util logging with its own logging configuration. -- View this message in context:

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread Xiangrui Meng
If you use Kryo serialier, you need to register mutable.BitSet and Rating: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102 The JIRA was marked resolved because chill resolved the problem in v0.4.0 and we have this

Re: flatMap followed by mapPartitions

2014-11-14 Thread Debasish Das
mapPartitions tried to hold data is memory which did not work for me.. I am doing flatMap followed by groupByKey now with HashPartitioner and number of blocks is 60 (Based on 120 cores I am running the job on)... Now when the shuffle size 100 GB it works fine...as flatMap shuffle goes to 200

How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
I'm running a local spark master (local[n]). I cannot seem to turn off the parquet logging. I tried: 1) Setting a log4j.properties on the classpath. 2) Setting a log4j.properties file in a spark install conf directory and pointing to the install using setSparkHome 3) Editing the

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav
how about using fluent style of Scala programming. On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com wrote: Let's say I have to apply a complex sequence of operations to a certain RDD. In order to make code more modular/readable, I would typically have something like

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Sean Owen
This code executes on the driver, and an RDD here is really just a handle on all the distributed data out there. It's a local bookkeeping object. So, manipulation of these objects themselves in the local driver code has virtually no performance impact. These two versions would be about identical*.

Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread YaoPau
I have an RDD x of millions of STRINGs, each of which I want to pass through a set of filters. My filtering code looks like this: x.filter(filter#1, which will filter out 40% of data). filter(filter#2, which will filter out 20% of data). filter(filter#3, which will filter out 2% of data).

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
This is a problem because (other than the fact that Parquet uses java.util.logging) of a bug in Spark in the current master. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else

Re: Set worker log configuration when running local[n]

2014-11-14 Thread Jim Carroll
Just to be complete, this is a problem in Spark that I worked around and detailed here: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-td18955.html -- View this message in context:

saveAsTextFile error

2014-11-14 Thread Niko Gamulin
Hi, I tried to modify NetworkWordCount example in order to save the output to a file. In NetworkWordCount.scala I replaced the line wordCounts.print() with wordCounts.saveAsTextFile(/home/bart/rest_services/output.txt) When I ran sbt/sbt package it returned the following error: [error]

Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh

Re: No module named pyspark - latest built

2014-11-14 Thread Andrew Or
I see. The general known constraints on building your assembly jar for pyspark on Yarn are: Java 6 NOT RedHat Maven Some of these are documented here http://spark.apache.org/docs/latest/building-with-maven.html (bottom). Maybe we should make it more explicit. 2014-11-13 2:31 GMT-08:00 jamborta

Re: saveAsTextFile error

2014-11-14 Thread Harold Nguyen
Hi Niko, It looks like you are calling a method on DStream, which does not exist. Check out: https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams for the method saveAsTextFiles Harold On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin

Re: Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread Aaron Davidson
In the situation you show, Spark will pipeline each filter together, and will apply each filter one at a time to each row, effectively constructing an statement. You would only see a performance difference if the filter code itself is somewhat expensive, then you would want to only execute it on

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Michael Armbrust
Anyone want a PR? Yes please.

Compiling Spark master HEAD failed.

2014-11-14 Thread Jianshi Huang
The mvn build command is mvn clean install -Pyarn -Phive -Phive-0.13.1 -Phadoop-2.4 -Djava.version=1.7 -DskipTests I'm getting this error message: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-hive_2.10: wrap:

Cancelled Key Exceptions on Massive Join

2014-11-14 Thread Ganelin, Ilya
Hello all. I have been running a Spark Job that eventually needs to do a large join. 24 million x 150 million A broadcast join is infeasible in this instance clearly, so I am instead attempting to do it with Hash Partitioning by defining a custom partitioner as: class

How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
I have instrumented word count to track how many machines the code runs on. I use an accumulator to maintain a Set or MacAddresses. I find that everything is done on a single machine. This is probably optimal for word count but not the larger problems I am working on. How to a force processing to

Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Egor Pahomov
Hi. I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = true. Task never ends. Code: import sys from random import random from operator import add partitions = 10 n = 10 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 +

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Daniel Siegmann
Most of the information you're asking for can be found on the Spark web UI (see here http://spark.apache.org/docs/1.1.0/monitoring.html). You can see which tasks are being processed by which nodes. If you're using HDFS and your file size is smaller than the HDFS block size you will only have one

Submitting Python Applications from Remote to Master

2014-11-14 Thread Benjamin Zaitlen
Hi All, I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. Am I reading this correctly: *A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Egor Pahomov
It's successful without dynamic allocation. I can provide spark log for that scenario if it can help. 2014-11-14 21:36 GMT+02:00 Sandy Ryza sandy.r...@cloudera.com: Hi Egor, Is it successful without dynamic allocation? From your log, it looks like the job is unable to acquire resources from

Kryo serialization in examples.streaming.TwitterAlgebirdCMS/HLL

2014-11-14 Thread Debasish Das
Hi, If I look inside algebird Monoid implementation it uses java.io.Serializable... But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't see a KryoRegistrator for CMS and HLL monoid... In these examples we will run with Kryo serialization on CMS and HLL or they will be java

Re: Using data in RDD to specify HDFS directory to write to

2014-11-14 Thread jschindler
I reworked my app using your idea of throwing the data in a map. It looks like it should work but I'm getting some strange errors and my job gets terminated. I get a WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered

Mulitple Spark Context

2014-11-14 Thread Charles
I need continuously run multiple calculations concurrently on a cluster. They are not sharing RDDs. Each of the calculations needs different number of cores and memory. Also, some of them are long running calculation and others are short running calculation.They all need be run on regular basis

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
Jira: https://issues.apache.org/jira/browse/SPARK-4412 PR: https://github.com/apache/spark/pull/3271 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-tp18955p18977.html Sent from the Apache Spark User List

SparkSQL exception on cached parquet table

2014-11-14 Thread Sadhan Sood
While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Sadhan Sood
Thanks Cheng, that was helpful. I noticed from UI that only half of the memory per executor was being used for caching, is that true? We have a 2 TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB memory but caching still failed and what looked like from the UI was that it

Re: Mulitple Spark Context

2014-11-14 Thread Daniil Osipov
Its not recommended to have multiple spark contexts in one JVM, but you could launch a separate JVM per context. How resources get allocated is probably outside the scope of Spark, and more of a task for the cluster manager. On Fri, Nov 14, 2014 at 12:58 PM, Charles charles...@cenx.com wrote: I

Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Sandy Ryza
That would be helpful as well. Can you confirm that when you try it with dynamic allocation the cluster has free resources? On Fri, Nov 14, 2014 at 12:17 PM, Egor Pahomov pahomov.e...@gmail.com wrote: It's successful without dynamic allocation. I can provide spark log for that scenario if it

RE: Mulitple Spark Context

2014-11-14 Thread Bui, Tri
Does this also apply to StreamingContext ? What issue would I have if I have 1000s of StreaminContext ? Thanks Tri From: Daniil Osipov [mailto:daniil.osi...@shazam.com] Sent: Friday, November 14, 2014 3:47 PM To: Charles Cc: u...@spark.incubator.apache.org Subject: Re: Mulitple Spark Context

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
The cluster runs Mesos and I can see the tasks in the Mesos UI but most are not doing much - any hints about that UI On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: Most of the information you're asking for can be found on the Spark web UI (see here

RE: Mulitple Spark Context

2014-11-14 Thread Charles
Thanks for your reply! Can you be more specific about the JVM? Is JVM referring to the driver application? if I want to create multiple sparkContext, I will need start a driver application instance for each sparkContext? -- View this message in context:

Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-14 Thread xiaoyan yu
I had the same need as those documented back to July archived at http://qnalist.com/questions/5013193/client-application-that-calls-spark-and-receives-an-mllib-model-scala-object-not-just-result . I wonder if anyone would like to share any successful stories. Thanks, Xiaoyan

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Andrew Or
Hey Egor, Have you checked the AM logs? My guess is that it threw an exception or something such that no executors (not even the initial set) have registered with your driver. You may already know this, but you can go to the http://RM address:8088 page and click into the application to access

Re: Sourcing data from RedShift

2014-11-14 Thread Michael Armbrust
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in

Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper http://dl.acm.org/citation.cfm?id=2670995. On Fri, Nov 14, 2014 at 10:42 AM, Josh J joshjd...@gmail.com wrote: Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost. We had been using 12 m3.xlarges,

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
I'll try this out and follow up with what I find. On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote: For each node, if the CSV reader is implemented efficiently, you should be able to hit at least half of the theoretical network bandwidth, which is about

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of memory is used for caching. You may refer to details from here http://spark.apache.org/docs/latest/configuration.html On 11/15/14 5:43 AM, Sadhan Sood wrote: Thanks Cheng, that was helpful. I noticed from UI that only half

filtering a SchemaRDD

2014-11-14 Thread Daniel, Ronald (ELS-SDG)
Hi all, I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one of which holds the text for a sentence from a document. # Load sentence data table sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing') sentenceRDD.take(3) Out[20]: [Row(annotID=118,

Re: filtering a SchemaRDD

2014-11-14 Thread Vikas Agarwal
Hi, did you try using single quote instead of double around column name? I faced similar situation with apache phoenix. On Saturday, November 15, 2014, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi all, I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields,

Re: filtering a SchemaRDD

2014-11-14 Thread Michael Armbrust
If I use row[6] instead of row[text] I get what I am looking for. However, finding the right numeric index could be a pain. Can I access the fields in a Row of a SchemaRDD by name, so that I can map, filter, etc. without a trial and error process of finding the right int for the