Re: java.io.IOException: Filesystem closed

2014-12-02 Thread rapelly kartheek
Sorry for the delayed response. Please find my application attached. On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: What is the application that you are submitting? Looks like you might have invoked fs inside the app and then closed it within it. Thanks Best

Re: java.io.IOException: Filesystem closed

2014-12-02 Thread Akhil Das
Your code seems to have a lot of threads and i think you might be invoking sc.stop before those threads get finished. Thanks Best Regards On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: What is the application that you are submitting? Looks like you might have

Re: java.io.IOException: Filesystem closed

2014-12-02 Thread rapelly kartheek
But, somehow, if I run this application for the second time, I find that the application gets executed and the results are out regardless of the same errors in logs. On Tue, Dec 2, 2014 at 2:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Your code seems to have a lot of threads and i think

Re: java.io.IOException: Filesystem closed

2014-12-02 Thread Akhil Das
It could be because those threads are finishing quickly. Thanks Best Regards On Tue, Dec 2, 2014 at 2:19 PM, rapelly kartheek kartheek.m...@gmail.com wrote: But, somehow, if I run this application for the second time, I find that the application gets executed and the results are out

Re: java.io.IOException: Filesystem closed

2014-12-02 Thread rapelly kartheek
Does the sparkContext shuts down itself by default even if I dont mention specifically in my code?? Because, I ran the application without sc.context(), still I get file system closed error along with correct output. On Tue, Dec 2, 2014 at 2:20 PM, Akhil Das ak...@sigmoidanalytics.com wrote: It

Re: Is Spark the right tool for me?

2014-12-02 Thread andy petrella
The point 4 looks weird to me, I mean if you intent to have such workflow to run in a single session (maybe consider sessionless arch) Is such process for each user? If it's the case, maybe finding a way to do it for all at once would be better (more data but less scheduling). For the micro

Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Xuelin Cao
Hi,       I'd like to make an operation on an RDD that ONLY change the value of   some items, without make a full copy or full scan of each data.      It is useful when I need to handle a large RDD, and each time I need only to change a little fraction of the data, and keeps other data

Re: Is Spark the right tool for me?

2014-12-02 Thread Stadin, Benjamin
To be precise I want the workflow to be associated to a user, but it doesn’t need to be run as part of or depend on a session. I can’t run scheduled jobs, because a user can potentially upload hundreds of files which trigger a long running batch import / update process but he could also make a

Re: Is Spark the right tool for me?

2014-12-02 Thread andy petrella
You might also have to check the Spark JobServer, it could help you at some point. On Tue Dec 02 2014 at 12:29:01 PM Stadin, Benjamin benjamin.sta...@heidelberg-mobil.com wrote: To be precise I want the workflow to be associated to a user, but it doesn’t need to be run as part of or depend on

Re: Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Akhil Das
RDDs are immutable, so if you want to change the value of an RDD then you have to create another RDD from it by applying some transformation. Not sure if this is what you are looking for: val rdd = sc.parallelize(Range(0,100)) val rdd2 = rdd.map(x = { println(Value : + x)

Re: Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Yanbo Liang
You can not modify one RDD in mapPartitions due to RDD is immutable. Once you apply transform function on RDDs, they will produce new RDDs. If you just want to modify only a fraction of the total RDD, try to collect the new value list to driver or use broadcast variable after each iteration, not

Re: Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Sean Owen
Although it feels like you are copying an RDD when you map it, it is not necessarily literally being copied. Your map function may pass through most objects unchanged. So there may not be so much overhead as you think. I don't think you can avoid a scan of the data unless you can somehow know

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-02 Thread Alexey Romanchuk
Any ideas? Anyone got the same error? On Mon, Dec 1, 2014 at 2:37 PM, Alexey Romanchuk alexey.romanc...@gmail.com wrote: Hello spark users! I found lots of strange messages in driver log. Here it is: 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25] ERROR

Parallelize independent tasks

2014-12-02 Thread Anselme Vignon
Hi folks, We have written a spark job that scans multiple hdfs directories and perform transformations on them. For now, this is done with a simple for loop that starts one task at each iteration. This looks like: dirs.foreach { case (src,dest) = sc.textFile(src).process.saveAsFile(dest) }

Does filter on an RDD scan every data item ?

2014-12-02 Thread nsareen
Hi , I wanted some clarity into the functioning of Filter function of RDD. 1) Does filter function scan every element saved in RDD? if my RDD represents 10 Million rows, and if i want to work on only 1000 of them, is there an efficient way of filtering the subset without having to scan every

pySpark saveAsSequenceFile append overwrite

2014-12-02 Thread Csaba Ragany
Dear Spark community, Has the pySpark saveAsSequenceFile(folder) method the ability to append the new sequencefile into an other one or to overwrite an existing sequencefile? If the folder already exists then I get an error message... Thank You! Csaba

IP to geo information in spark streaming application

2014-12-02 Thread Noam Kfir
Hi I'm new to spark streaming. I'm currently writing spark streaming application to standardize events coming from Kinesis. As part of the logic, I want to use IP to geo information library or service. My questions: 1) If I would use some REST service for this task, do U think it would

Re: Does filter on an RDD scan every data item ?

2014-12-02 Thread Gen
Hi, For your first question, I think that we can use /sc.parallelize(rdd.take(1000))/ For your second question, I am not sure. But I don't think that we can restricted filter within certain partition without scan every element. Cheers Gen nsareen wrote Hi , I wanted some clarity into the

Re: MLlib Naive Bayes classifier confidence

2014-12-02 Thread MariusFS
Are we sure that exponentiating will give us the probabilities? I did some tests by cloning the MLLIb class and adding the required code but the calculated probabilities do not add up to 1. I tried something like : def predictProbs(testData: Vector): (BDV[Double], BDV[Double]) = { val

Re: Is Spark the right tool for me?

2014-12-02 Thread Roger Hoover
I’ve also considered to use Kafka to message between Web UI and the pipes, I think it will fit. Chaining the pipes together as a workflow and implementing, managing and monitoring these long running user tasks with locality as I need them is still causing me headache. You can look at Apache

Re: IP to geo information in spark streaming application

2014-12-02 Thread Akhil Das
1. If you use some custom API library, there's a chance to end up with Serialization errors and all, but a normal http REST api would work fine except there could be a bit of performance lag + those api's might limit the number of requests. 2. I would go for this approach, either i will broadcast

Re: Parallelize independent tasks

2014-12-02 Thread Victor Tso-Guillen
dirs.par.foreach { case (src,dest) = sc.textFile(src).process.saveAsFile(dest) } Is that sufficient for you? On Tuesday, December 2, 2014, Anselme Vignon anselme.vig...@flaminem.com wrote: Hi folks, We have written a spark job that scans multiple hdfs directories and perform

sort algorithm using sortBy

2014-12-02 Thread bchazalet
I am trying to understand the sort algorithm that is used in RDD#sortBy. I have read that post from Matei http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328p14332.html and that helps a little bit already. I'd like to further understand the

Re: Loading RDDs in a streaming fashion

2014-12-02 Thread Ashish Rangole
This is a common use case and this is how Hadoop APIs for reading data work, they return an Iterator [Your Record] instead of reading every record in at once. On Dec 1, 2014 9:43 PM, Andy Twigg andy.tw...@gmail.com wrote: You may be able to construct RDDs directly from an iterator - not sure -

Re: Low Level Kafka Consumer for Spark

2014-12-02 Thread RodrigoB
Hi Dibyendu,What are your thoughts on keeping this solution (or not), considering that Spark Streaming v1.2 will have built-in recoverability of the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm concerned about the complexity of this solution with regards the added complexity

Re: Spark setup on local windows machine

2014-12-02 Thread Sunita Arvind
Thanks Sameer and Akhil for your help. I tried both your suggestions however, I still face the same issue. There was indeed space in the installation path for Scala and Sbt since I had let the defaults stay and hence the path was C:\Program Files . I reinstalled scala and sbt in c:\ as well

Re: Negative Accumulators

2014-12-02 Thread Peter Thai
Similarly, I'm having an issue with the above solution when I use the math.min() function to add to an accumulator. I'm seeing negative overflow numbers again. This code works fine without the math.min() and even if I add an arbitrarily large number like 100 // doesn't work someRDD.foreach(x={

Scala Dependency Injection

2014-12-02 Thread Venkat Subramanian
This is a more of a Scala question than Spark question. Which Dependency Injection framework do you guys use for Scala when using Spark? Is http://scaldi.org/ recommended? Regards Venkat -- View this message in context:

Help understanding - Not enough space to cache rdd

2014-12-02 Thread akhandeshi
I am running in local mode. I am using google n1-highmem-16 (16 vCPU, 104 GB memory) machine. I have allocated the SPARK_DRIVER_MEMORY=95g I see Memory: 33.6 GB Used (73.7 GB Total) that the exeuctor is using. In the log out put below, I see 33.6 gb blocks are used by 2 rdds that I have cached.

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-12-02 Thread Judy Nash
Any suggestion on how can user with custom Hadoop jar solve this issue? -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Sunday, November 30, 2014 11:06 PM To: Judy Nash Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-12-02 Thread Marcelo Vanzin
On Tue, Dec 2, 2014 at 11:22 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Any suggestion on how can user with custom Hadoop jar solve this issue? You'll need to include all the dependencies for that custom Hadoop jar to the classpath. Those will include Guava (which is not included in

SchemaRDD + SQL , loading projection columns

2014-12-02 Thread Vishnusaran Ramaswamy
Hi, I have 16 GB of parquet files in /tmp/logs/ folder with the following schema request_id(String), module(String), payload(Array[Byte]) Most of my 16 GB data is the payload field, the request_id, and module fields take less than 200 MB. I want to load the payload only when my filter

Unresolved attributes

2014-12-02 Thread Eric Tanner
I am running spark 1.1.0 DSE cassandra 4.6 when I try to run the following sql statement: val sstring = Select * from seasonality where customer_id = + customer_id + and cat_id = + seg + and period_desc = + cDate println(sstring = +sstring) val rrCheckRdd =

WordCount fails in .textFile() method

2014-12-02 Thread Rahul Swaminathan
Hi, I am trying to run JavaWordCount without using the spark-submit script. I have copied the source code for JavaWordCount and am using a JavaSparkContext with the following: SparkConf conf = new SparkConf().setAppName(JavaWordCount);

WordCount fails in .textFile() method

2014-12-02 Thread Rahul Swaminathan
Hi, I am trying to run JavaWordCount without using the spark-submit script. I have copied the source code for JavaWordCount and am using a JavaSparkContext with the following: SparkConf conf = new SparkConf().setAppName(JavaWordCount);

SaveAsTextFile brings down data nodes with IO Exceptions

2014-12-02 Thread Ganelin, Ilya
Hi all, as the last stage of execution, I am writing out a dataset to disk. Before I do this, I force the DAG to resolve so this is the only job left in the pipeline. The dataset in question is not especially large (a few gigabytes). During this step however, HDFS will inevitable crash. I will

Using SparkSQL to query Hbase entity takes very long time

2014-12-02 Thread bonnahu
Hi all, I am new to Spark and currently I am trying to run a SparkSQL query on HBase entity. For an entity with about 4000 rows, it will take about 12 seconds. Is it expected? Is there any way to shorten the query process? Here is the code snippet: SparkConf sparkConf = new

Announcing Spark 1.1.1!

2014-12-02 Thread Andrew Or
I am happy to announce the availability of Spark 1.1.1! This is a maintenance release with many bug fixes, most of which are concentrated in the core. This list includes various fixes to sort-based shuffle, memory leak, and spilling issues. Contributions from this release came from 55 developers.

Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Hello all, Is there a way to load an RDD in a small driver app and connect with a JDBC client and issue SQL queries against it? It seems the thrift server only works with pre-existing Hive tables. Thanks Jim -- View this message in context:

executor logging management from python

2014-12-02 Thread freedafeng
Hi, wondering if anyone could help with this. We use ec2 cluster to run spark apps in standalone mode. The default log info goes to /$spark_folder/work/. This folder is in the 10G root fs. So it won't take long to fill up the whole fs. My goal is 1. move the logging location to /mnt, where we

Re: Negative Accumulators

2014-12-02 Thread Peter Thai
To answer my own question, I was declaring the accumulator incorrectly. The code should look like this: scala import org.apache.spark.AccumulatorParam import org.apache.spark.AccumulatorParam scala :paste // Entering paste mode (ctrl-D to finish) implicit object BigIntAccumulatorParam extends

RE: Calling spark from a java web application.

2014-12-02 Thread Mohammed Guller
Jamal, I have not tried this, but can you not integrate Spark SQL with your Spring Java web app just like a standalone app? I have integrated a Scala web app (using Play) with Spark SQL and it works. Mohammed From: adrian [mailto:adria...@gmail.com] Sent: Friday, November 28, 2014 11:03 AM To:

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Michael Armbrust
There is an experimental method that allows you to start the JDBC server with an existing HiveContext (which might have registered temporary tables). https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Thanks! I'll give it a try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197p20202.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unresolved attributes

2014-12-02 Thread Michael Armbrust
A little bit about how to read this output. Resolution occurs from the bottom up and when you see a tick (') it means that a field is unresolved. So here it looks like Year_2011_Month_0_Week_0_Site is missing from from your RDD. (We are working on more obvious error messages). Michael On Tue,

Re: ALS failure with size Integer.MAX_VALUE

2014-12-02 Thread Xiangrui Meng
Hi Bharath, You can try setting a small item blocks in this case. 1200 is definitely too large for ALS. Please try 30 or even smaller. I'm not sure whether this could solve the problem because you have 100 items connected with 10^8 users. There is a JIRA for this issue:

object xxx is not a member of package com

2014-12-02 Thread flyson
Hello everyone, Could anybody tell me how to import and call the 3rd party java classes from inside spark? Here's my case: I have a jar file (the directory layout is com.xxx.yyy.zzz) which contains some java classes, and I need to call some of them in spark code. I used the statement import

Re: Any ideas why a few tasks would stall

2014-12-02 Thread Sameer Farooqui
Have you tried taking thread dumps via the UI? There is a link to do so on the Executors' page (typically under http://driver IP:4040/exectuors. By visualizing the thread call stack of the executors with slow running tasks, you can see exactly what code is executing at an instant in time. If you

Re: Kryo NPE with Array

2014-12-02 Thread Simone Franzini
I finally solved this issue. The problem was that: 1. I defined a case class with a Buffer[MyType] field. 2. I instantiated the class with the field set to the value given by an implicit conversion from a Java list, which is supposedly a Buffer. 3. However, the underlying type of that field was

Reading from Kerberos Secured HDFS in Spark?

2014-12-02 Thread Matt Cheah
Hi everyone, I¹ve been trying to set up Spark so that it can read data from HDFS, when the HDFS cluster is integrated with Kerberos authentication. I¹ve been using the Spark shell to attempt to read from HDFS, in local mode. I¹ve set all of the appropriate properties in core-site.xml and

Re: executor logging management from python

2014-12-02 Thread freedafeng
cat spark-env.sh -- #!/usr/bin/env bash export SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time -Dspark.executor.logs.rolling.time.interval=daily -Dspark.executor.logs.rolling.maxRetainedFiles=3 export SPARK_LOCAL_DIRS=/mnt/spark export SPARK_WORKER_DIR=/mnt/spark -- But

Re: Any ideas why a few tasks would stall

2014-12-02 Thread Steve Lewis
1) I can go there but none of the links are clickable 2) when I see something like 116/120 partitions succeeded in the stages ui in the storage ui I see NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the number of machines which will not complete Also RDD 27 does not show up

Re: Calling spark from a java web application.

2014-12-02 Thread nsareen
We have a web application which talks to spark server. This is how we have done the integration. 1) In the tomcat's classpath, add the spark distribution jar for spark code to be available at runtime ( for you it would be Jetty). 2) In the Web application project, add the spark distribution jar

Re: Problem creating EC2 cluster using spark-ec2

2014-12-02 Thread Nicholas Chammas
Interesting. Do you have any problems when launching in us-east-1? What is the full output of spark-ec2 when launching a cluster? (Post it to a gist if it’s too big for email.) ​ On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis dave.chal...@aistemos.com wrote: I've been trying to create a Spark

Re: Problem creating EC2 cluster using spark-ec2

2014-12-02 Thread Shivaram Venkataraman
+Andrew Actually I think this is because we haven't uploaded the Spark binaries to cloudfront / pushed the change to mesos/spark-ec2. Andrew, can you take care of this ? On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Interesting. Do you have any problems

RE: Low Level Kafka Consumer for Spark

2014-12-02 Thread Shao, Saisai
Hi Rod, The purpose of introducing WAL mechanism in Spark Streaming as a general solution is to make all the receivers be benefit from this mechanism. Though as you said, external sources like Kafka have their own checkpoint mechanism, instead of storing data in WAL, we can only store

Re: IP to geo information in spark streaming application

2014-12-02 Thread qinwei
1) I think using library based solution is a better idea, we used that, and it works.2) We used broadcast variable, and it works qinwei  From: Noam KfirDate: 2014-12-02 23:14To: user@spark.apache.orgSubject: IP to geo information in spark streaming application Hi I'm new to

Re: Spark SQL table Join, one task is taking long

2014-12-02 Thread Venkat Subramanian
Bump up. Michael Armbrust, anybody from Spark SQL team? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124p20218.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Viewing web UI after fact

2014-12-02 Thread lihu
How do you solved this problem? I run the standalone application, but there is no APPLICATION_COMPLETE file too. On Sat, Nov 8, 2014 at 2:11 PM, Arun Ahuja aahuj...@gmail.com wrote: We are running our applications through YARN and are only somtimes seeing them into the History Server. Most do

Monitoring Spark

2014-12-02 Thread Isca Harmatz
hello, im running spark on a cluster and i want to monitor how many nodes/ cores are active in different (specific) points of the program. is there any way to do this? thanks, Isca

Monitoring Spark

2014-12-02 Thread Isca Harmatz
hello, im running spark on a cluster and i want to monitor how many nodes/ cores are active in different (specific) points of the program. is there any way to do this? thanks, Isca

Re: Monitoring Spark

2014-12-02 Thread Otis Gospodnetic
Hi Isca, I think SPM can do that for you: http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Dec 2, 2014 at 11:57 PM, Isca Harmatz