Slow Mongo Read from Spark

2015-08-31 Thread Deepesh Maheshwari
Hi, I am trying to read mongodb in Spark newAPIHadoopRDD. / Code */ config.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat"); config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI); config.set("mongo.input.query","{host: 'abc.com'}"); JavaSparkContext sc=new

Re: Help Explain Tasks in WebUI:4040

2015-08-31 Thread Akhil Das
Are you doing a join/groupBy such operation? In that case i would suspect that the keys are not evenly distributed and that's why few of the tasks are spending way too much time doing the actual processing. You might want to look into custom partitioners

Distance Calculation in Spark K means clustering

2015-08-31 Thread ashensw
Hi all, I am currently working on some K means clustering project. I want to get the distances of each data point to it's cluster center after building the K means model. Currently I get the cluster centers of each data point by sending the JavaRDD which includes all the data points to K means

Re: Help Explain Tasks in WebUI:4040

2015-08-31 Thread Igor Berman
are there other processes on sk3? or more generally are you sharing resources with somebody else, virtualization etc does your transformation consumes other services?(e.g. reading from s3, so it can happen that s3 latency plays the role...) can it be that task per some key will take longer than

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread Akhil Das
Check Spark Jobserver Thanks Best Regards On Mon, Aug 31, 2015 at 8:54 AM, canan chen wrote: > I found the rest api for submitting spark job, but did not found the > document. Could any one point it to me ? Thanks >

Re: spark-submit issue

2015-08-31 Thread Igor Berman
might be you need to drain stdout/stderr of subprocess...otherwise subprocess can deadlock http://stackoverflow.com/questions/3054531/correct-usage-of-processbuilder On 27 August 2015 at 16:11, pranay wrote: > I have a java program that does this - (using Spark

Re: Slow Mongo Read from Spark

2015-08-31 Thread Akhil Das
Can you try with these key value classes and see the performance? inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" keyClassName = "org.apache.hadoop.io.Text" valueClassName = "org.apache.hadoop.io.MapWritable" Taken from databricks blog

Re: Slow Mongo Read from Spark

2015-08-31 Thread Akhil Das
Here's a piece of code which works well for us (spark 1.4.1) Configuration bsonDataConfig = new Configuration(); bsonDataConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat"); Configuration predictionsConfig = new Configuration();

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-31 Thread Timothy Sum Hon Mun
Dear Sandy, Many thanks for your reply. I am going to respond to your replies in reverse order if you don't mind as my second question is the more pressing issue for now. In the situation where you give more memory, but less memory overhead, and > the job completes less quickly, have you

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-31 Thread timothy22000
Added log files and diagnostics to first and second cases and removed the images. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Effects-of-Driver-Memory-Executor-Memory-Driver-Memory-Overhead-and-Executor-Memory-Overhead-os-tp24507p24528.html Sent

Re: bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
I think I have found out what was causing me difficulties. It seems I was reading too much into the stage description shown in the "Stages" tab of the Spark application UI. While it said "repartition at NativeMethodAccessorImpl.java:-2", I can infer from the network traffic and from its response

Re: Reading xml in java using spark

2015-08-31 Thread Rick Hillegas
Hi Rakesh, You might also take a look at the Derby code. org.apache.derby.vti.XmlVTI provides a number of static methods for turning an XML resource into a JDBC ResultSet. Thanks, -Rick On 8/31/15 4:44 AM, Sonal Goyal wrote: I think the mahout project had an xmlinoutformat which you can

Re: spark-submit issue

2015-08-31 Thread Akhil Das
Can you not use the spark jobserver instead? Just submit your job to the job server who already has the sparkcontext initialized in it, it would make it much easier i think. Thanks Best Regards On Mon, Aug 31, 2015 at 2:16 PM, Pranay Tonpay wrote: > thx for the

Re: spark streaming 1.3 kafka topic error

2015-08-31 Thread Cody Koeninger
You can't set it to less than 1 Just set it to max int if that's really what you want to do On Mon, Aug 31, 2015 at 6:00 AM, Shushant Arora wrote: > Say if my cluster takes long time for rebalance for some reason > intermittently . So to handle that Can I have

start master failed with error

2015-08-31 Thread Garry Chen
Hi All, Error when issued start-master.sh as following. Even I have my CLASSPATH=/u01/app/jdk1.8.0_60/lib:/u01/app/apache-hive-1.2.1-bin/lib:/u01/app/slf4j-1.7.12 set in environment profile. Can anyone help? Thank you very much. Garry Spark Command:

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ted Yu
Ashish: Can you post the complete stack trace for NotSerializableException ? Cheers On Mon, Aug 31, 2015 at 8:49 AM, Ashish Shrowty wrote: > bcItemsIdx is just a broadcast variable constructed out of Array[(String)] > .. it holds the item ids and I use it for indexing

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-08-31 Thread Olivier Girardot
tested now against Spark 1.5.0 rc2, and same exceptions happen when num-executors > 2 : 15/08/25 10:31:10 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 5.0 (TID 501, xxx): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long at

Managing httpcomponent dependency in Spark/Solr

2015-08-31 Thread Oliver Schrenk
Hi, We are running a distibuted indexing service for Solr (4.7) on a Spark (1.2) cluster. Now we wanted to upgrade to Solr 5.3 and are running into problems with dependencies. Solr 5 brings in org.apache.httpcomponents httpclient 4.4.1 (1) and the prebuilt binary for Spark 1.2.2 for CDH 4

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ashish Shrowty
Sure .. here it is (scroll below to see the NotSerializableException). Note that upstream, I do load up the (user,item,ratings) data from a file using ObjectInputStream, do some calculations that I put in a map and then create the rdd used in the code above from that map. I even tried

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ted Yu
I am not familiar with your code. bq. and then create the rdd I assume you call ObjectOutputStream.close() prior to the above step. Cheers On Mon, Aug 31, 2015 at 9:42 AM, Ashish Shrowty wrote: > Sure .. here it is (scroll below to see the NotSerializableException).

Spark executor OOM issue on YARN

2015-08-31 Thread unk1102
Hi I have Spark job and its executors hits OOM issue after some time and my job hangs because of it followed by couple of IOException, Rpc client disassociated, shuffle not found etc I have tried almost everything dont know how do I solve this OOM issue please guide I am fed up now. Here what I

Re: Spark executor OOM issue on YARN

2015-08-31 Thread Ted Yu
Please see this thread w.r.t. spark.sql.shuffle.partitions : http://search-hadoop.com/m/q3RTtE7JOv1bDJtY FYI On Mon, Aug 31, 2015 at 11:03 AM, unk1102 wrote: > Hi I have Spark job and its executors hits OOM issue after some time and my > job hangs because of it followed

Re: Spark executor OOM issue on YARN

2015-08-31 Thread Umesh Kacha
Hi Ted thanks I know by default spark.sql.shuffle.partition are 200. It would be great if you help me solve OOM issue. On Mon, Aug 31, 2015 at 11:43 PM, Ted Yu wrote: > Please see this thread w.r.t. spark.sql.shuffle.partitions : >

Potential NPE while exiting spark-shell

2015-08-31 Thread nasokan
I'm currently using Spark 1.3.0 on yarn cluster deployed through CDH5.4. My cluster does not have a 'default' queue, and launching 'spark-shell' submits an yarn application that gets killed immediately because queue does not exist. However, the spark-shell session is still in progress after

Re: How to send RDD result to REST API?

2015-08-31 Thread Cassa L
Hi Ted, My server is expecting JSON. Can I just write HttpClient in spark job and push result of RDD action to the server? I'm trying to figure out how to achieve this. LCassa On Fri, Aug 28, 2015 at 9:45 PM, Ted Yu wrote: > What format does your REST server expect ? >

Re: Potential NPE while exiting spark-shell

2015-08-31 Thread Akhil Das
Looks like you are hitting this https://issues.apache.org/jira/browse/SPARK-5869 try to update your spark version Thanks Best Regards On Tue, Sep 1, 2015 at 12:09 AM, nasokan wrote: > I'm currently using Spark 1.3.0 on yarn cluster deployed through CDH5.4. My > cluster

Too many open files issue

2015-08-31 Thread Sigurd Knippenberg
I am running in a 'too many open files' issue and before I posted this I have searched on the web to see if anyone had a solution already to my particular problem but I did not see anything that helped. I know I can adjust the max open files allowed by the OS but I'd rather fix the underlaying

Re: start master failed with error

2015-08-31 Thread Akhil Das
slf4j classes would already be present in your spark-assembly jar (do a *jar tf /u01/app/spark-1.4.1-bin-without-hadoop/lib/spark-assembly-1.4.1-hadoop2.2.0.jar | grep "org/slf4j/Logger"*) [image: Inline image 1] Thanks Best Regards On Mon, Aug 31, 2015 at 9:32 PM, Garry Chen

Re: How to send RDD result to REST API?

2015-08-31 Thread Ted Yu
You can utilize commons-httpclient and push result of RDD to the server. BTW specify commons-httpclient version which is consistent with the hadoop version you use. e.g. the following is brought in along with hadoop 2.6: [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:compile

Re: Standalone mode: is SPARK_WORKER_MEMORY per SPARK_WORKER_INSTANCE?

2015-08-31 Thread Akhil Das
Its for each worker instance i guess. So you should be giving SPARK_WORKER_MEMORY=10g to utilize your 80g on 8 worker instances. Thanks Best Regards On Mon, Aug 31, 2015 at 1:36 PM, Muler wrote: > Hi, > > Is SPARK_WORKER_MEMORY defined per SPARK_WORKER_INSTANCE (just

Is it possible to create spark cluster in different network?

2015-08-31 Thread sakana
Hi I am successful create Spark cluster in openStack. I want to create spark cluster in different openStack sites. In openstack, if you create instance, it only know it's private ip ( like 10.x.y.z ), it will not know it have public IP for itself. ( I try to export

Re: Potential NPE while exiting spark-shell

2015-08-31 Thread Nithin Asokan
SPARK-5869 appears to have the same exception and is fixed in 1.3.0. I double checked the CDH package to see if it had the patch https://github.com/cloudera/spark/blob/cdh5.4.4-release/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L161 In my case, my yarn application fails

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ashish Shrowty
Yes .. I am closing the stream. Not sure what you meant by "bq. and then create rdd"? -Ashish On Mon, Aug 31, 2015 at 1:02 PM Ted Yu wrote: > I am not familiar with your code. > > bq. and then create the rdd > > I assume you call ObjectOutputStream.close() prior to the

Problems with Tungsten in Spark 1.5.0-rc2

2015-08-31 Thread Anders Arpteg
Was trying out 1.5 rc2 and noticed some issues with the Tungsten shuffle manager. One problem was when using the com.databricks.spark.avro reader and the error(1) was received, see stack trace below. The problem does not occur with the "sort" shuffle manager. Another problem was in a large

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Ted Yu
I used the notation on JIRA where bq means quote. FYI On Mon, Aug 31, 2015 at 12:34 PM, Ashish Shrowty wrote: > Yes .. I am closing the stream. > > Not sure what you meant by "bq. and then create rdd"? > > -Ashish > > On Mon, Aug 31, 2015 at 1:02 PM Ted Yu

Exceptions in threads in executor code don't get caught properly

2015-08-31 Thread Wayne Song
We've been running into a situation where exceptions in rdd.map() calls will not get recorded and shown on the web UI properly. We've discovered that this seems to occur because we're creating our own threads in foreachPartition() calls. If I have code like this: The tasks on the executors

Re: [MLlib] DIMSUM row similarity?

2015-08-31 Thread Reza Zadeh
This is ongoing work tracked by SPARK-4823 with a PR for it here: PR6213 - unfortunately the PR submitter didn't make it for Spark 1.5. On Mon, Aug 31, 2015 at 4:17 AM, Maandy wrote:

Checkpointing in Spark without Streaming

2015-08-31 Thread Ian Wood
I've been trying to track down a problem in Spark code relating to a task with many iterations. When trying to recreate an error with simpler code, I ran into a StackOverflowError due to large lineage. The solution is to add checkpoints, but the behavior of checkpoints is not well defined in the

Parsing nested json objects with variable structure

2015-08-31 Thread SK
Hi, I need to parse a json input file where the nested objects take on a different structure based on the typeId field, as follows: { "d": { "uid" : "12345" "contents": [{"info": {"eventId": "event1"}, "typeId": 19}] } } { "d": { "uid" : "56780"

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-31 Thread Sandy Ryza
Hi Timothy, For your first question, you would need to look in the logs and provide additional information about why your job is failing. The SparkContext shutting down could happen for a variety of reasons. In the situation where you give more memory, but less memory overhead, and the job

Window Sliding In spark

2015-08-31 Thread pankaj.wahane
Hi, I have a RDD of Time series data coming from Cassandra table. I want to create a sliding window on this rdd so that I get new rdd with each element containing exactly six sequential elements from rdd in sorted manner.. Thanks in advance, Pankaj Wahane Sent from Mail for Windows 10 --

Re: Parallel execution of RDDs

2015-08-31 Thread Brian Parker
Thank you for the comments. As you mentioned, increasing the thread pool succeeded to allow more parallel jobs and decreasing #partitions allowed more RDDs to execute in parallel. Much appreciated On Aug 31, 2015 7:07 AM, "Igor Berman" wrote: > what is size of the pool

bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
Hi, I am working on a pipeline that carries out a number of stages, the last of which is to build some large JSON objects from information in the preceding stages. The JSON objects are then uploaded to Elasticsearch in bulk. If I carry out a shuffle via a `repartition` call after the JSON

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-08-31 Thread Davies Liu
I had sent out a PR [1] to fix 2), could you help to test that? [1] https://github.com/apache/spark/pull/8543 On Mon, Aug 31, 2015 at 12:34 PM, Anders Arpteg wrote: > Was trying out 1.5 rc2 and noticed some issues with the Tungsten shuffle > manager. One problem was when

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread canan chen
>> current I don't think there's a document address this part, also this rest api is only used for SparkSubmit currently, not public API as I know. Even it is used by SparkSubmit, I think document is still necessary about how to use it in SparkSubmit. I am not sure why there's no document mention

Connection closed error while running Terasort

2015-08-31 Thread Suman Somasundar
Hi, I am getting the following error while trying to run a 10GB terasort under Yarn with 8 nodes. The command is: spark-submit --class com.github.ehiggs.spark.terasort.TeraSort --master yarn-cluster --num-executors 10 --executor-memory 32g

Group by specific key and save as parquet

2015-08-31 Thread gtinside
Hi , I have a set of data, I need to group by specific key and then save as parquet. Refer to the code snippet below. I am querying trade and then grouping by date val df = sqlContext.sql("SELECT * FROM trade") val dfSchema = df.schema val partitionKeyIndex =

Re: [MLlib] DIMSUM row similarity?

2015-08-31 Thread Brian Parker
Not sure about the performance, but for now you could do: val mat = new IndexedRowMatrix(...) .toCoordinateMatrix() .transpose() .toRowMatrix() On Mon, Aug 31, 2015 at 1:31 PM, Reza Zadeh

[MLlib] DIMSUM row similarity?

2015-08-31 Thread Maandy
I've been trying to implement cosine similarity using Spark and stumbled upon this article https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html The only problem I have with it is that it seems that they assume that in my input file each *column* is a

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread Saisai Shao
Here is the Rest related part in Spark ( https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/deploy/rest ), current I don't think there's a document address this part, also this rest api is only used for SparkSubmit currently, not public API as I know. Thanks Jerry

Re: spark-submit issue

2015-08-31 Thread Pranay Tonpay
thx for the inputs Igor,, i am actually building an Analytics layer ( 'As a service model' using Spark as the backend engine ) and hence i am implementing it this way... Initially, i was opening the spark-contenxt in the JVM that i had spawned ( without even using Spark-submit ) and adding all

Data Security on Spark-on-HDFS

2015-08-31 Thread Daniel Schulz
Hi guys, In a nutshell: does Spark check and respect user privileges when reading/writing data. I am curious about the data security when Spark runs on top of HDFS — maybe though YARN. Is Spark running it's long-running JVM processes as a Spark user, that makes no distinction when accessing

Re: spark-submit issue

2015-08-31 Thread Pranay Tonpay
Igor,, this seems to be the cause, however i am not sure at the moment how to resolve it ... what i tried just now was that after " SparkSubmitDriverBootstrapper" process reaches the hung stage... i went inside /proc//fd and just tailed "2" (stderr) and the process immediately exits .

Re: spark-submit issue

2015-08-31 Thread Igor Berman
1. think once again if you want to call spark submit in such way...I'm not sure why you do it, but please consider just opening spark context inside your jvm(you need to add spark jars to classpath..) 2. use https://commons.apache.org/proper/commons-exec/ with PumpStreamHandler On 31 August 2015

Re: spark streaming 1.3 kafka topic error

2015-08-31 Thread Shushant Arora
Say if my cluster takes long time for rebalance for some reason intermittently . So to handle that Can I have infinite retries instead of killing the app? What should be the value of retries (-1) will work or something else ? On Thu, Aug 27, 2015 at 6:46 PM, Cody Koeninger

Re: spark-submit issue

2015-08-31 Thread Pranay Tonpay
Akhil,, i am returning from the function (after closing sparkContext) as i have to return back some values to the calling function. Regardless, i tried System.exit too,, no luck ... From: Akhil Das Sent: Monday, August 31, 2015

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Sean Owen
Yeah I see that now. I think it fails immediately because the map operation does try to clean and/or verify the serialization of the closure upfront. I'm not quite sure what is going on, but I think it's some strange interaction between how you're building up the list and what the resulting

Re: Slow Mongo Read from Spark

2015-08-31 Thread Akhil Das
FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class itself underneath and it doesnt mean it will only read from HDFS. Give it a shot if you haven't tried it already (it just the inputformat and the reader which are different from your approach). Thanks Best Regards On Mon, Aug

Write Concern used in Mongo-Hadoop Connector

2015-08-31 Thread Deepesh Maheshwari
Hi, I am using below code to insert data in mongodb from spark. JavaPairRDD rdd; Configuration config = new Configuration(); config.set("mongo.output.uri", SparkProperties.MONGO_OUTPUT_URI); config.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat");

Re: Unable to build Spark 1.5, is build broken or can anyone successfully build?

2015-08-31 Thread Sean Owen
I don't think it's an 'issue'; the build does in fact require Maven 3.3.3 and Java 7 on purpose. If you don't have Maven 3.3.3, run 'build/mvn' or ('build/mvn --force' if you have an older Maven locally) to get 3.3.3. On Mon, Aug 31, 2015 at 6:27 AM, Kevin Jung wrote: > I

Re: Slow Mongo Read from Spark

2015-08-31 Thread Deepesh Maheshwari
Hi Akhil, This code snippet is from below link https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java Here it reading data from HDFS file system but in our case i need to read from mongodb. I have tried it earlier and now again tried it

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread canan chen
I mean the spark builtin rest api On Mon, Aug 31, 2015 at 3:09 PM, Akhil Das wrote: > Check Spark Jobserver > > > Thanks > Best Regards > > On Mon, Aug 31, 2015 at 8:54 AM, canan chen wrote: > >>

Re: spark-submit issue

2015-08-31 Thread Pranay Tonpay
Akhil,, i am not sure if that would let me catch errors, if any ? my requirement is that i have to keep track of any error that happens and notify the user of it... i have not used JobServer, but would take a look... Igor,, I changed log4j in Spark and rather than the default System.err,

Re: Data Security on Spark-on-HDFS

2015-08-31 Thread Steve Loughran
> On 31 Aug 2015, at 11:02, Daniel Schulz wrote: > > Hi guys, > > In a nutshell: does Spark check and respect user privileges when > reading/writing data. Yes, in a locked down YARN cluster —until your tokens expire > > I am curious about the data security

Re: Spark shell and StackOverFlowError

2015-08-31 Thread Sean Owen
It's not clear; that error is different still and somehow suggests you're serializing a stream somewhere. I'd look at what's inside bcItemsIdx as that is not shown here. On Mon, Aug 31, 2015 at 3:34 PM, Ashish Shrowty wrote: > Sean, > > Thanks for your comments. What I

Re: Parallel execution of RDDs

2015-08-31 Thread Igor Berman
what is size of the pool you submitting spark jobs from(futures you've mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't be more than 8 parallel jobs running...so try to increase it what is number of partitions of each of your rdds? how many cores has your worker

Parallel execution of RDDs

2015-08-31 Thread Brian Parker
Hi, I have a large number of RDDs that I need to process separately. Instead of submitting these jobs to the Spark scheduler one by one, I'd like to submit them in parallel in order to maximize cluster utilization. I've tried to process the RDDs as Futures, but the number of Active jobs maxes out

Re: Write Concern used in Mongo-Hadoop Connector

2015-08-31 Thread Ted Yu
Take a look at the following: https://github.com/mongodb/mongo-hadoop/blob/master/core/src/main/java/com/mongodb/hadoop/MongoOutputFormat.java https://jira.mongodb.org/plugins/servlet/mobile#issue/HADOOP-82 > On Aug 31, 2015, at 4:39 AM, Deepesh Maheshwari >