Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-08-03 Thread Umesh Kacha
Hi all any help will be much appreciated my spark job runs fine but in the middle it starts loosing executors because of netafetchfailed exception saying shuffle not found at the location since executor is lost On Jul 31, 2015 11:41 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi thanks for the

How to calculate standard deviation of grouped data in a DataFrame?

2015-08-03 Thread the3rdNotch
I have user logs that I have taken from a csv and converted into a DataFrame in order to leverage the SparkSQL querying features. A single user will create numerous entries per hour, and I would like to gather some basic statistical information for each user; really just the count of the user

large scheduler delay in pyspark

2015-08-03 Thread gen tang
Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by adding two list if I do reduceByKey as

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
Hi, there was this issue for Scala 2.11. https://issues.apache.org/jira/browse/SPARK-7944 It should be fixed on master branch. You may be hitting that. Best, Burak On Sun, Aug 2, 2015 at 9:06 PM, Ted Yu yuzhih...@gmail.com wrote: I tried the following command on master branch: bin/spark-shell

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
In addition, you do not need to use --jars with --packages. --packages will get the jar for you. Best, Burak On Mon, Aug 3, 2015 at 9:01 AM, Burak Yavuz brk...@gmail.com wrote: Hi, there was this issue for Scala 2.11. https://issues.apache.org/jira/browse/SPARK-7944 It should be fixed on

Re: HiveQL to SparkSQL

2015-08-03 Thread Bigdata techguy
Did anybody try to convert HiveQL queries to SparkSQL? If so, would you share the experience, pros cons please? Thank you. On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy bigdatatech...@gmail.com wrote: Thanks Jorn for the response and for the pointer questions to Hive optimization tips.

Re: NullPointException Help while using accumulators

2015-08-03 Thread Ted Yu
Can you show related code in DriverAccumulator.java ? Which Spark release do you use ? Cheers On Mon, Aug 3, 2015 at 3:13 PM, Anubhav Agarwal anubha...@gmail.com wrote: Hi, I am trying to modify my code to use HDFS and multiple nodes. The code works fine when I run it locally in a single

Re: Contributors group and starter task

2015-08-03 Thread Ted Yu
Once you submit a pull request for some JIRA, the JIRA would be assigned to you. Cheers On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya katariya.na...@gmail.com wrote: My username on the Apache JIRA is katariya.namit. Could one of the admins please add me to the contributors group so that I

Re: SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
I think I just answered my own question. The privitization of the RDD API might have resulted in my error, because this worked: randomMatBr - SparkR:::broadcast(sc, randomMat) On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hello, In looking at the SparkR

Safe to write to parquet at the same time?

2015-08-03 Thread Philip Weaver
I think this question applies regardless if I have two completely separate Spark jobs or tasks on different machines, or two cores that are part of the same task on the same machine. If two jobs/tasks/cores/stages both save to the same parquet directory in parallel like this:

shutdown local hivecontext?

2015-08-03 Thread Cesar Flores
We are using a local hive context in order to run unit tests. Our unit tests runs perfectly fine if we run why by one using sbt as the next example: sbt test-only com.company.pipeline.scalers.ScalerSuite.scala sbt test-only com.company.pipeline.labels.ActiveUsersLabelsSuite.scala However, if we

Re: Contributors group and starter task

2015-08-03 Thread Marcelo Vanzin
Hi Namit, There's no need to assign a bug to yourself to say you're working on it. The recommended way is to just post a PR on github - the bot will update the bug saying that you have a patch open to fix the issue. On Mon, Aug 3, 2015 at 3:50 PM, Namit Katariya katariya.na...@gmail.com wrote:

SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message randomMat - matrix(nrow=10, ncol=10, data=rnorm(100))

NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
Hi, I am trying to modify my code to use HDFS and multiple nodes. The code works fine when I run it locally in a single machine with a single worker. I have been trying to modify it and I get the following error. Any hint would be helpful. java.lang.NullPointerException at

How does DataFrame except work?

2015-08-03 Thread Srikanth
Hello, I'm planning to use DF1.except(DF2) to get difference between two dataframes. I'd like to know how exactly this API works. Both explain() and spark UI show except as an operation on its own. Internally, does does it do a hash partition of both dataframes? If so will it do auto broadcast if

Re: shutdown local hivecontext?

2015-08-03 Thread Michael Armbrust
TestHive takes care of creating a temporary directory for each invocation so that multiple test runs won't conflict. On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores ces...@gmail.com wrote: We are using a local hive context in order to run unit tests. Our unit tests runs perfectly fine if we run

Multiple UpdateStateByKey Functions in the same job?

2015-08-03 Thread swetha
Hi, Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose I need to maintain the state of the user session in the form of a Json and counts of various other metrics which has different keys ? Can I use multiple updateStateByKey functions to maintain the state for different

Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Upen N
Hi, I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this version. I created Spark gateways. But I get the following error when run Spark shell from the gateway. Does anyone have any similar experience ? If so, please share the solution. Google shows to copy the Conf files from

Re: Writing to HDFS

2015-08-03 Thread ayan guha
Is your data skewed? What happens if you do rdd.count()? On 4 Aug 2015 05:49, Jasleen Kaur jasleenkaur1...@gmail.com wrote: I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024

Unable to compete with performance of single-threaded Scala application

2015-08-03 Thread Philip Weaver
Hello, I am running Spark 1.4.0 on Mesos 0.22.1, and usually I run my jobs in coarse-grained mode. I have written some single-threaded standalone Scala applications for a problem that I am working on, and I am unable to get a Spark solution that comes close to the performance of this

Re: NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
The code was written in 1.4 but I am compiling it and running it with 1.3. import it.unimi.dsi.fastutil.objects.Object2ObjectOpenHashMap; import org.apache.spark.AccumulableParam; import scala.Tuple4; import thomsonreuters.trailblazer.operation.DriverCalc; import

Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Marcelo Vanzin
That should not be a fatal error, it's just a noisy exception. Anyway, it should go away if you add YARN gateways to those nodes (aside from Spark gateways). On Mon, Aug 3, 2015 at 7:10 PM, Upen N ukn...@gmail.com wrote: Hi, I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with

Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Guru Medasani
Hi Upen, Did you deploy the client configs after assigning the gateway roles? You should be able to do this from Cloudera Manager. Can you try this and let us know what you see when you run spark-shell? Guru Medasani gdm...@gmail.com On Aug 3, 2015, at 9:10 PM, Upen N ukn...@gmail.com

Re: NullPointException Help while using accumulators

2015-08-03 Thread Ted Yu
Putting your code in a file I find the following on line 17: stepAcc = new StepAccumulator(); However I don't think that was where the NPE was thrown. Another thing I don't understand was that there were two addAccumulator() calls at the top of stack trace while in your code I

Contributors group and starter task

2015-08-03 Thread Namit Katariya
My username on the Apache JIRA is katariya.namit. Could one of the admins please add me to the contributors group so that I can have a starter task assigned to myself? Thanks, Namit

Re: Spark-Submit error

2015-08-03 Thread satish chandra j
Hi Guru, I am executing this on DataStax Enterprise Spark node and ~/.dserc file exists which consists Cassandra credentials but still getting the error Below is the given command dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars

Re: Spark-Submit error

2015-08-03 Thread Guru Medasani
Hi Satish, Can you add more error or log info to the email? Guru Medasani gdm...@gmail.com On Jul 31, 2015, at 1:06 AM, satish chandra j jsatishchan...@gmail.com wrote: HI, I have submitted a Spark Job with options jars,class,master as local but i am getting an error as below dse

Re: Data from PostgreSQL to Spark

2015-08-03 Thread Jeetendra Gangele
Here is the solution this looks perfect for me. thanks for all your help http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/ On 28 July 2015 at 23:27, Jörn Franke jornfra...@gmail.com wrote: Can you put some transparent cache in front of the database? Or

Re: Spark-Submit error

2015-08-03 Thread Guru Medasani
Thanks Satish. I only see the INFO messages and don’t see any error messages in the output you pasted. Can you paste the log with the error messages? Guru Medasani gdm...@gmail.com On Aug 3, 2015, at 11:12 PM, satish chandra j jsatishchan...@gmail.com wrote: Hi Guru, I am executing

Repartition question

2015-08-03 Thread Naveen Madhire
Hi All, I am running the WikiPedia parsing example present in the Advance Analytics with Spark book. https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112 The partitions of the RDD returned by

Re: Unable to query existing hive table from spark sql 1.3.0

2015-08-03 Thread Ishwardeep Singh
Your table is in which database - default or result. By default spark will try to look for table in default database. If the table exists in the result database try to prefix the table name with database name like select * from result.salarytest or set the database by executing use database name

spark streaming max receiver rate doubts

2015-08-03 Thread Shushant Arora
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't set spark.streaming.kafka.maxRatePerPartition - so default behavious is to bring all messages from kafka from last offset to current offset ? Say no of messages were large and it took 5 sec to process those so will all jobs

spark --files permission error

2015-08-03 Thread Shushant Arora
Is there any setting to allow --files to copy jar from driver to executor nodes. When I am passing some jar files using --files to executors and adding them in class path of executor it throws exception of File not found 15/08/03 07:59:50 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8,

Re: spark cluster setup

2015-08-03 Thread Akhil Das
Are you sitting behind a firewall and accessing a remote master machine? In that case, have a look at this http://spark.apache.org/docs/latest/configuration.html#networking, you might want to fix few properties like spark.driver.host, spark.driver.host etc. Thanks Best Regards On Mon, Aug 3,

Re: Checkpoint file not found

2015-08-03 Thread Anand Nalya
Hi, Its an application that maintains some state from the DStream using updateStateByKey() operation. It then selects some of the records from current batch using some criteria over current values and the state and carries over the remaining values to next batch. Following is the pseudo code :

Re: About memory leak in spark 1.4.1

2015-08-03 Thread Barak Gitsis
Sea, it exists, trust me. We have spark in production under Yarn. if you want more control use Yarn if you can. At least it kills the executor if it hogs memory.. I am explicitly setting spark.yarn.executor.memoryOverhead to the same size as heap for one of our processes For example:

Running multiple batch jobs in parallel using Spark on Mesos

2015-08-03 Thread Akash Mishra
Hello *, We are trying to build some Batch jobs using Spark on Mesos. Mesos offer's two main mode of deployment of Spark job. 1. Fine-grained 2. Coarse-grained When we are running the spark jobs in fine grained mode then spark is using max amount of offers from Mesos and running the job.

Re: spark cluster setup

2015-08-03 Thread Sonal Goyal
Your master log files will be on the spark home folder/logs at the master machine. Do they show an error ? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

RE: SparkLauncher not notified about finished job - hangs infinitely.

2015-08-03 Thread Tomasz Guziałek
Reading from the input stream and the error stream (in separate threads) indeed unblocked the launcher and it exited properly. Thanks for your responses! Best regards, Tomasz From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, July 31, 2015 19:20 To: Elkhan Dadashov Cc: Tomasz Guziałek;

Re: Checkpoint file not found

2015-08-03 Thread Tathagata Das
Can you tell us more about streaming app? DStream operation that you are using? On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na...@gmail.com wrote: Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other

Is it possible to disable AM page proxy in Yarn client mode?

2015-08-03 Thread Rex Xiong
In Yarn client mode, Spark driver URL will be redirected to Yarn web proxy server, but I don't want to use this dynamic name, is it possible to still use host:port as standalone mode?

How do I Process Streams that span multiple lines?

2015-08-03 Thread Spark Enthusiast
All  examples of Spark Stream programming that I see assume streams of lines that are then tokenised and acted upon (like the WordCount example). How do I process Streams that span multiple lines? Are there examples that I can use? 

EOFException when transmitting a class that extends Externalizable

2015-08-03 Thread Michael Knapp
Hi, I am having a problem serializing a custom partitioner that I have written that extends Externalizable. The partitioner wraps a java TreeSet which stores table splits. There are thousands of splits. I noticed earlier that my spark job was taking over 30 seconds just to transmit a task to

Re: Standalone Cluster Local Authentication

2015-08-03 Thread Ted Yu
Looks like related work is in progress. e.g. SPARK-5158 Cheers On Mon, Aug 3, 2015 at 10:05 AM, MrJew kouz...@gmail.com wrote: Hello, Similar to other cluster systems e.g Zookeeper, Hazelcast. Spark has the problem that is protected from the outside world however anyone having access to

Re: How do I Process Streams that span multiple lines?

2015-08-03 Thread Michal Čizmazia
Are you looking for RDD.wholeTextFiles? On 3 August 2015 at 10:57, Spark Enthusiast sparkenthusi...@yahoo.in wrote: All examples of Spark Stream programming that I see assume streams of lines that are then tokenised and acted upon (like the WordCount example). How do I process Streams that

Re: How do I Process Streams that span multiple lines?

2015-08-03 Thread Michal Čizmazia
Sorry. SparkContext.wholeTextFiles Not sure about streams. On 3 August 2015 at 14:50, Michal Čizmazia mici...@gmail.com wrote: Are you looking for RDD.wholeTextFiles? On 3 August 2015 at 10:57, Spark Enthusiast sparkenthusi...@yahoo.in wrote: All examples of Spark Stream programming

Standalone Cluster Local Authentication

2015-08-03 Thread MrJew
Hello, Similar to other cluster systems e.g Zookeeper, Hazelcast. Spark has the problem that is protected from the outside world however anyone having access to the host can run a spark node without the need for authentication. Currently we are using Spark 1.3.1. Is there a way to enable

Re: Is it possible to disable AM page proxy in Yarn client mode?

2015-08-03 Thread Steve Loughran
the reason that redirect is there is for security reasons; in a kerberos enabled cluster the RM proxy does the authentication, then forwards the requests to the running application. There's no obvious way to disable it in the spark application master, and I wouldn't recommend doing this anyway,

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-03 Thread Ted Yu
When I tried to compile against hbase 1.1.1, I got: [ERROR] /home/hbase/ssoh/src/main/scala/org/apache/spark/sql/hbase/SparkSqlRegionObserver.scala:124: overloaded method next needs result type [ERROR] override def next(result: java.util.List[Cell], limit: Int) = next(result) Is there plan to

Does RDD.cartesian involve shuffling?

2015-08-03 Thread Meihua Wu
Does RDD.cartesian involve shuffling? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Sujit Pal
@Silvio: the mapPartitions instantiates a HttpSolrServer, then for each query string in the partition, sends the query to Solr using SolrJ, and gets back the top N results. It then reformats the result data into one long string and returns the key value pair as (query string, result string).

Re: Standalone Cluster Local Authentication

2015-08-03 Thread Steve Loughran
On 3 Aug 2015, at 10:05, MrJew kouz...@gmail.com wrote: Hello, Similar to other cluster systems e.g Zookeeper, Actually, Zookeeper supports SASL authentication of your Kerberos tokens. https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zookeeper+and+SASL Hazelcast. Spark has the

Combine code for RDD and DStream

2015-08-03 Thread Sidd S
Hello! I am developing a Spark program that uses both batch and streaming (separately). They are both pretty much the exact same programs, except the inputs come from different sources. Unfortunately, RDD's and DStream's define all of their transformations in their own files, and so I have two

Re: Extremely poor predictive performance with RF in mllib

2015-08-03 Thread Barak Gitsis
hi, I've run into some poor RF behavior, although not as pronounced as you.. would be great to get more insight into this one Thanks! On Mon, Aug 3, 2015 at 8:21 AM pkphlam pkph...@gmail.com wrote: Hi, This might be a long shot, but has anybody run into very poor predictive performance

Re: About memory leak in spark 1.4.1

2015-08-03 Thread Igor Berman
in general, what is your configuration? use --conf spark.logConf=true we have 1.4.1 in production standalone cluster and haven't experienced what you are describing can you verify in web-ui that indeed spark got your 50g per executor limit? I mean in configuration page.. might be you are using

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit

2015-08-03 Thread Rajeshkumar J
Hi Everyone, I am using Apache Spark for 2 weeks and as of now I am querying hive tables using spark java api. And it is working fine in Hadoop single mode but when I tried the same code in Hadoop multi cluster it throws org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't

Fwd: org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit

2015-08-03 Thread Rajeshkumar J
Hi Everyone, I am using Apache Spark for 2 weeks and as of now I am querying hive tables using spark java api. And it is working fine in Hadoop single mode but when I tried the same code in Hadoop multi cluster it throws org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't

Re: spark streaming program failed on Spark 1.4.1

2015-08-03 Thread Cody Koeninger
Just to be clear, did you rebuild your job against spark 1.4.1 as well as upgrading the cluster? On Mon, Aug 3, 2015 at 8:36 AM, Netwaver wanglong_...@163.com wrote: Hi All, I have a spark streaming + kafka program written by Scala, it works well on Spark 1.3.1, but after I migrate

spark streaming program failed on Spark 1.4.1

2015-08-03 Thread Netwaver
Hi All, I have a spark streaming + kafka program written by Scala, it works well on Spark 1.3.1, but after I migrate my Spark cluster to 1.4.1 and rerun this program, I meet below exception: ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Ajay Singal
Hi Sujit, From experimenting with Spark (and other documentation), my understanding is as follows: 1. Each application consists of one or more Jobs 2. Each Job has one or more Stages 3. Each Stage creates one or more Tasks (normally, one Task per Partition) 4. Master

Re: how to ignore MatchError then processing a large json file in spark-sql

2015-08-03 Thread Michael Armbrust
This sounds like a bug. What version of spark? and can you provide the stack trace? On Sun, Aug 2, 2015 at 11:27 AM, fuellee lee lifuyu198...@gmail.com wrote: I'm trying to process a bunch of large json log files with spark, but it fails every time with `scala.MatchError`, Whether I give it

Re: Combine code for RDD and DStream

2015-08-03 Thread Sidd S
DStreams transform function helps me solve this issue elegantly. Thanks! On Mon, Aug 3, 2015 at 1:42 PM, Sidd S ssinga...@gmail.com wrote: Hello! I am developing a Spark program that uses both batch and streaming (separately). They are both pretty much the exact same programs, except the

Re: how to convert a sequence of TimeStamp to a dataframe

2015-08-03 Thread Michael Armbrust
In general it needs to be a Seq of Tuples for the implicit toDF to work (which is a little tricky when there is only one column). scala Seq(Tuple1(new java.sql.Timestamp(System.currentTimeMillis))).toDF(a) res3: org.apache.spark.sql.DataFrame = [a: timestamp] or with multiple columns scala

Writing to HDFS

2015-08-03 Thread Jasleen Kaur
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024 - spark.default.parallelism=25600 - driver-memory=4G - executor-memory=32G. - My input size is around 1.5TB. My problem

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread shahid ashraf
hi sujit Can you spin it with 4 (server)*4 (cores) 16 cores i.e there should be 16 cores in your cluster, try to use same no. of partitions. Also look at the http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-td23824.html On Tue, Aug 4, 2015 at 1:46 AM, Ajay

Re: Python, Spark and HBase

2015-08-03 Thread ericbless
I wanted to confirm whether this is now supported, such as in Spark v1.3.0 I've read varying info online just thought I'd verify. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p24117.html Sent from the Apache Spark