Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-16 Thread Akhil Das
If you use fileStream, there's an option to filter out files. In your case you can easily create a filter to remove _temporary files. In that case, you will have to move your codes inside foreachRDD of the dstream since the application will become a streaming app. Thanks Best Regards On Sat, Mar

Re: org.apache.spark.SparkException Error sending message

2015-03-16 Thread Akhil Das
Not sure if this will help, but can you try setting the following: set("spark.core.connection.ack.wait.timeout","6000") Thanks Best Regards On Sat, Mar 14, 2015 at 4:08 AM, Chen Song wrote: > When I ran Spark SQL query (a simple group by query) via hive support, I > have seen lots of failures

Re: how to print RDD by key into file with grouByKey

2015-03-16 Thread Akhil Das
If you want more partitions then you have specify it as: Rdd.groupByKey(*10*).mapValues... ​I think if you don't specify anything, the # partitions will be the # cores that you have for processing.​ Thanks Best Regards On Sat, Mar 14, 2015 at 12:28 AM, Adrian Mocanu wrote: > Hi > > I have a

Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will d

Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank

2015-03-16 Thread hseagle
Hi all, I'm wondering whether the latest spark-1.3.0 supports the windowing and analytic funtions in hive, such as row_number, rank and etc. Indeed, I've done some testing by using spark-shell and found that row_number is not supported yet. But I still found that there were s

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Akhil Das
You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type o

Re: Running Scala Word Count Using Maven

2015-03-16 Thread Su She
Hello, So actually solved the problem...see point 3. Here are a few approaches/errors I was getting: 1) mvn package exec:java -Dexec.mainClass=HelloWorld Error: java.lang.ClassNotFoundException: HelloWorld 2) http://stackoverflow.com/questions/26929100/running-a-scala-application-in-maven-pro

How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi, I have set spark.executor.memory to 2048m, and in the UI "Environment" page, I can see this value has been set correctly. But in the "Executors" page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here? T

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Dibyendu Bhattacharya
Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present we

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
It doesn’t take effect if just putting jar files under the lib-managed/jars folder, you need to put that under class path explicitly. From: sandeep vura [mailto:sandeepv...@gmail.com] Sent: Monday, March 16, 2015 2:21 PM To: Cheng, Hao Cc: fightf...@163.com; Ted Yu; user Subject: Re: Re: Unable t

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread sandeep vura
which location should i need to specify the classpath exactly . Thanks, On Mon, Mar 16, 2015 at 12:52 PM, Cheng, Hao wrote: > It doesn’t take effect if just putting jar files under the > lib-managed/jars folder, you need to put that under class path explicitly. > > > > *From:* sandeep vura [

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen wrote: > Hi, > > I have set spark.executor.memory to 2048m, and in the UI "Environment" > page, I can see this value has been set correctly. But in the "Executors" > page, I

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Akhil, I have checked the logs. There isn't any clue as to why the 5 receivers failed. That's why I just take it for granted that it will be a common issue for receiver failures, and we need to figure out a way to detect this kind of failure and do fail-over. Thanks On Mon, Mar 16, 2015 at 3:1

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Dibyendu, Thanks for the reply. I am reading your project homepage now. One quick question I care about is: If the receivers failed for some reasons(for example, killed brutally by someone else), is there any mechanism for the receiver to fail over automatically? On Mon, Mar 16, 2015 at 3:25 P

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Akhil Das
There should be something in the worker/driver logs regarding the failure. For receiver failures, you can try the lowlevel kafka consumer as Dibyendu suggested, You need to have a high-availability setup with Monitoring enabled (nagios etc config

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Akhil Das
As i seen, once i kill my receiver on one machine, it will automatically spawn another receiver on another machine or on the same machine. Thanks Best Regards On Mon, Mar 16, 2015 at 1:08 PM, Jun Yang wrote: > Dibyendu, > > Thanks for the reply. > > I am reading your project homepage now. > > O

start-slave.sh failed with ssh port other than 22

2015-03-16 Thread ZhuGe
Hi all:I am new to spark and i want to set up a cluster of 3 nodes( standalone mode)I can start the master and see the web ui.Because the ssh port of the 3 nodes is configured to 58518, so when i use sbin/start-slave.sh, the log message shows ssh: connect to host node1 port 22: connection refuse

Processing of text file in large gzip archive

2015-03-16 Thread sergunok
I have a 30GB gzip file (originally that is text file where each line represents text document) in HDFS and Spark 1.2.0 under YARN cluster with 3 worker nodes with 64GB RAM and 4 cores on each node. Replictaion factor for my file is 3. I tried to implement simple pyspark script to parse this file

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
I have checked Dibyendu's code, it looks that his implementation has auto-restart mechanism: src/main/java/consumer/kafka

unable to access spark @ spark://debian:7077

2015-03-16 Thread Ralph Bergmann
Hi, I try my first steps with Spark but I have problems to access Spark running on my Linux server from my Mac. I start Spark with sbin/start-all.sh When I now open the website at port 8080 I see that all is running and I can access Spark at port 7077 but this doesn't work. I scanned the Linux

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Dibyendu Bhattacharya
Yes.. Auto restart is enabled in my low level consumer ..when there is some unhandled exception comes... Even if you see KafkaConsumer.java, for some cases ( like broker failure, kafka leader changes etc ) it can even refresh the Consumer (The Coordinator which talks to a Leader) which will recove

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das wrote: > How are you setting it? and how are you submitting the job? > > Thanks > Best Regards > > On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen wrote

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
I used "local[*]". The CPU hits about 80% when there are active jobs, then it drops to about 13% and hand for a very long time. Thanks, David On Mon, 16 Mar 2015 17:46 Akhil Das wrote: > How many threads are you allocating while creating the sparkContext? like > local[4] will allocate 4 threads

MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Current implementation of map function in spark streaming looks as below. def map[U: ClassTag](mapFunc: T => U): DStream[U] = { new MappedDStream(this, context.sparkContext.clean(mapFunc)) } It creates an instance of MappedDStream which is a subclass of DStream. The same function can

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
By default spark.executor.memory is set to 512m, I'm assuming since you are submiting the job using spark-submit and it is not able to override the value since you are running in local mode. Can you try it without using spark-submit as a standalone project? Thanks Best Regards On Mon, Mar 16, 201

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Akhil Das
Try setting SPARK_MASTER_IP and you need to use the Spark URI (spark://yourlinuxhost:7077) as displayed in the top left corner of Spark UI (running on port 8080). Also when you are connecting from your mac, make sure your network/firewall isn't blocking any port between the two machines. Thanks Be

RE: MappedStream vs Transform API

2015-03-16 Thread Shao, Saisai
I think these two ways are both OK for you to write streaming job, `transform` is a more general way for you to transform from one DStream to another if there’s no related DStream API (but have related RDD API). But using map maybe more straightforward and easy to understand. Thanks Jerry From

Re: start-slave.sh failed with ssh port other than 22

2015-03-16 Thread Akhil Das
Open sbin/slaves.sh and sbin/spark-daemon.sh and then look for ssh command, pass the port argument to that command in your case *-p 58518* and save those files, do a start-all.sh :) Thanks Best Regards On Mon, Mar 16, 2015 at 1:37 PM, ZhuGe wrote: > Hi all: > I am new to spark and i want to set

Re: Processing of text file in large gzip archive

2015-03-16 Thread Akhil Das
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. 2. Instead of map, do a mapPartitions 3. You need to open the driver UI and see what's really taking time. If that is running on a remote machine and you are not able to access

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi Akhil, Yes, you are right. If I ran the program from IDE as a normal java program, the executor's memory is increased...but not to 2048m, it is set to 6.7GB...Looks like there's some formula to calculate this value. Thanks, David On Mon, Mar 16, 2015 at 7:36 PM Akhil Das wrote: > By defau

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
How much memory are you having on your machine? I think default value is 0.6 of the spark.executor.memory as you can see from here . Thanks Best Regards On Mon, Mar 16, 2015 at 2:26 PM, Xi Shen wrote: > Hi Akhil, > > Yes,

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
I set "spark.executor.memory" to "2048m". If the executor storage memory is 0.6 of executor memory, it should be 2g * 0.6 = 1.2g. My machine has 56GB memory, and 0.6 of that should be 33.6G...I hate math xD On Mon, Mar 16, 2015 at 7:59 PM Akhil Das wrote: > How much memory are you having on yo

[no subject]

2015-03-16 Thread Hector
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
Strange, even i'm having it while running in local mode. [image: Inline image 1] I set it as .set("spark.executor.memory", "1g") Thanks Best Regards On Mon, Mar 16, 2015 at 2:43 PM, Xi Shen wrote: > I set "spark.executor.memory" to "2048m". If the executor storage memory > is 0.6 of executor

Re: Which OutputCommitter to use for S3?

2015-03-16 Thread Pei-Lun Lee
Hi, I created a JIRA and PR for supporting a s3 friendly output committer for saveAsParquetFile: https://issues.apache.org/jira/browse/SPARK-6352 https://github.com/apache/spark/pull/5042 My approach is add a DirectParquetOutputCommitter class in spark-sql package and use a boolean config variabl

Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Hi, We're facing "No space left on device" errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't be helpful. Sure it's the problem in the datanodes but I'm wondering if Spark Driver can handle it and decommission the problematic datanode before retryi

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang wrote: > Hi, > > We're facing "No space left on device" errors lately from time to time. > The job will fail after retries. Obvious in such case, retry won't be > helpful. > > Sure

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Shixiong Zhu
There are 2 cases for "No space left on device": 1. Some tasks which use large temp space cannot run in any node. 2. The free space of datanodes is not balance. Some tasks which use large temp space can not run in several nodes, but they can run in other nodes successfully. Because most of our ca

Re: How to pass parameters to a spark-jobserver Scala class?

2015-03-16 Thread Sasi
Sorry for the long silence. We are able to 1. Pass parameters from Vaadin (Java Framework) to spark-jobserver using HttpURLConnection POST method. 2. Receive filtered (based on passed parameters) RDD results from spark-jobserver using HttpURLConnection GET method. 3. Finally, showing the results o

Re: MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Thanks for the response. I understand that part. But I am asking why the internal implementation using a subclass when it can use an existing api? Unless there is a real difference, it feels like code smell to me. Regards, Madhukara Phatak http://datamantra.io/ On Mon, Mar 16, 2015 at 2:14

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread sandeep vura
Hi, Please find the attached is my spark configuration files. Regards, Sandeep.v On Mon, Mar 16, 2015 at 12:58 PM, sandeep vura wrote: > which location should i need to specify the classpath exactly . > > Thanks, > > > On Mon, Mar 16, 2015 at 12:52 PM, Cheng, Hao wrote: > >> It doesn’t take

Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
> 1. I don't think textFile is capable of unpacking a .gz file. You need to use > hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition. P

Re: Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank

2015-03-16 Thread Arush Kharbanda
You can track the issue here. https://issues.apache.org/jira/browse/SPARK-1442 Its currently not supported, i guess the test cases are work in progress. On Mon, Mar 16, 2015 at 12:44 PM, hseagle wrote: > Hi all, > > I'm wondering whether the latest spark-1.3.0 supports the windowing > an

Parquet and repartition

2015-03-16 Thread Masf
Hi all. When I specify the number of partitions and save this RDD in parquet format, my app fail. For example selectTest.coalesce(28).saveAsParquetFile("hdfs://vm-clusterOutput") However, it works well if I store data in text selectTest.coalesce(28).saveAsTextFile("hdfs://vm-clusterOutput") M

Error when using multiple python files spark-submit

2015-03-16 Thread poiuytrez
I have a spark app which is composed of multiple files. When I launch Spark using: ../hadoop/spark-install/bin/spark-submit main.py --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 I am getting an error: 15/0

Re: How to set Spark executor memory?

2015-03-16 Thread Sean Owen
There are a number of small misunderstandings here. In the first instance, the executor memory is not actually being set to 2g and the default of 512m is being used. If you are writing code to launch an app, then you are trying to duplicate what spark-submit does, and you don't use spark-submit. I

Re: Parquet and repartition

2015-03-16 Thread Masf
Thanks Sean, I forgot it The ouput error is the following: java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at org.apache.spar

Re: k-means hang without error/warning

2015-03-16 Thread Sean Owen
I think you'd have to say more about "stopped working". Is the GC thrashing? does the UI respond? is the CPU busy or not? On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen wrote: > Hi, > > I am running k-means using Spark in local mode. My data set is about 30k > records, and I set the k = 1000. > > The a

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
Hi Sean, My system is windows 64 bit. I looked into the resource manager, Java is the only process that used about 13% CPU recourse; no disk activity related to Java; only about 6GB memory used out of 56GB in total. My system response very well. I don't think it is a system issue. Thanks, David

Can I start multiple executors in local mode?

2015-03-16 Thread Xi Shen
Hi, In YARN mode you can specify the number of executors. I wonder if we can also start multiple executors at local, just to make the test run faster. Thanks, David

Re: jar conflict with Spark default packaging

2015-03-16 Thread Shawn Zheng
Thanks a lot. I will give a try! On Monday, March 16, 2015, Adam Lewandowski wrote: > Prior to 1.3.0, Spark has 'spark.files.userClassPathFirst' for non-yarn > apps. For 1.3.0, use 'spark.executor.userClassPathFirst'. > > See > https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3C

Re: Can I start multiple executors in local mode?

2015-03-16 Thread xu Peng
Hi David, You can try the local-cluster. the number in local-cluster[2,2,1024] represents that there are 2 worker, 2 cores and 1024M Best Regards Peng Xu 2015-03-16 19:46 GMT+08:00 Xi Shen : > Hi, > > In YARN mode you can specify the number of executors. I wonder if we can > also start multip

Re: configure number of cached partition in memory on SparkSQL

2015-03-16 Thread Cheng Lian
Hi Judy, In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is actually decided by the |InputFormat| used. And |spark.sql.inMemoryColumnarStorage.batchSize| is not related to partition number, it controls the in-memory columnar batch size within a single partition. Also, what d

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Sean Owen
Are you sure the master / slaves started? Do you have network connectivity between the two? Do you have multiple interfaces maybe? Does debian resolve correctly and as you expect to the right host/interface? On Mon, Mar 16, 2015 at 8:14 AM, Ralph Bergmann wrote: > Hi, > > > I try my first steps w

Iterative Algorithms with Spark Streaming

2015-03-16 Thread Alex Minnaar
I wanted to ask a basic question about the types of algorithms that are possible to apply to a DStream with Spark streaming. With Spark it is possible to perform iterative computations on RDDs like in the gradient descent example val points = spark.textFile(...).map(parsePoint).cache() v

Re: Spark 1.3 createDataframe error with pandas df

2015-03-16 Thread kevindahl
kevindahl wrote > I'm trying to create a spark data frame from a pandas data frame, but for > even the most trivial of datasets I get an error along the lines of this: > > --- > Py4JJavaError Traceb

RE: How to set Spark executor memory?

2015-03-16 Thread jishnu.prathap
Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set("spark.executor.memory", "2g") Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, March 16

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Ralph Bergmann
I can access the manage webpage at port 8080 from my mac and it told me that master and 1 slave is running and I can access them at port 7077 But the port scanner shows that port 8080 is open but not port 7077. I started the port scanner on the same machine where Spark is running. Ralph Am 16.

insert hive partitioned table

2015-03-16 Thread patcharee
Hi, I tried to insert into a hive partitioned table val ZONE: Int = Integer.valueOf(args(2)) val MONTH: Int = Integer.valueOf(args(3)) val YEAR: Int = Integer.valueOf(args(4)) val weightedUVToDF = weightedUVToRecord.toDF() weightedUVToDF.registerTempTable("speeddata") hiveContext.sql("INSERT OV

HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Bharath Ravi Kumar
Hi, Trying to run spark ( 1.2.1 built for hdp 2.2) against a yarn cluster results in the AM failing to start with following error on stderr: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher An application id was assigned to the job, but there were no logs. Not

Re: Iterative Algorithms with Spark Streaming

2015-03-16 Thread Nick Pentreath
MLlib supports streaming linear models: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression and k-means: http://spark.apache.org/docs/latest/mllib-clustering.html#k-means With an iteration parameter of 1, this amounts to mini-batch SGD where the mini-batch is

Re: insert hive partitioned table

2015-03-16 Thread patcharee
I would like to insert the table, and the value of the partition column to be inserted must be from temporary registered table/dataframe. Patcharee On 16. mars 2015 15:26, Cheng Lian wrote: Not quite sure whether I understand your question properly. But if you just want to read the partitio

Re: insert hive partitioned table

2015-03-16 Thread Cheng Lian
Not quite sure whether I understand your question properly. But if you just want to read the partition columns, it’s pretty easy. Take the “year” column as an example, you may do this in HiveQL: |hiveContext.sql("SELECT year FROM speed") | or in DataFrame DSL: |hiveContext.table("speed").sele

Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows: lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3) For more details on why, see this answer . Nick ​ On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier wrote: > 1. I don't think textFi

Priority queue in spark

2015-03-16 Thread abhi
Hi Current all the jobs in spark gets submitted using queue . i have a requirement where submitted job will generate another set of jobs with some priority , which should again be submitted to spark cluster based on priority ? Means job with higher priority should be executed first,Is it feasib

Re: insert hive partitioned table

2015-03-16 Thread Cheng Lian
I see. Since all Spark SQL queries must be issued from the driver side, you'll have to first collect all interested values to the driver side, and then use them to compose one or more insert statements. Cheng On 3/16/15 10:33 PM, patcharee wrote: I would like to insert the table, and the value

Re: How to preserve/preset partition information when load time series data?

2015-03-16 Thread Imran Rashid
Hi Shuai, It should certainly be possible to do it that way, but I would recommend against it. If you look at HadoopRDD, its doing all sorts of little book-keeping that you would most likely want to mimic. eg., tracking the number of bytes & records that are read, setting up all the hadoop confi

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Todd Nist
Hi Bharath, I ran into the same issue a few days ago, here is a link to a post on Horton's fourm. http://hortonworks.com/community/forums/search/spark+1.2.1/ Incase anyone else needs to perform this these are the steps I took to get it to work with Spark 1.2.1 as well as Spark 1.3.0-RC3: 1. Pul

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
Hi Shuai, On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng wrote: > Sorry I response late. > > Zhan Zhang's solution is very interesting and I look at into it, but it is > not what I want. Basically I want to run the job sequentially and also gain > parallelism. So if possible, if I have 1000 parti

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Bharath Ravi Kumar
Hi Todd, Thanks for the help. I'll try again after building a distribution with the 1.3 sources. However, I wanted to confirm what I mentioned earlier: is it sufficient to copy the distribution only to the client host from where spark-submit is invoked(with spark.yarn.jar set), or is there a need

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check spark.scheduler.executorTaskBlacklistTime. Jianshi On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu wrote: > There are 2 cases for "No space left on device": > > 1. Some tasks which use larg

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang wrote: > Thanks Shixiong! > > Very strange that our tasks were retried on the same executor again and > again. I'll check spark.scheduler.execut

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Ralph Bergmann
Okay I think I found the mistake The Eclipse Maven plug suggested me version 1.2.1 of the spark-core lib but I use Spark 1.3.0 As I fixed it I can access the Spark server. Ralph Am 16.03.15 um 14:39 schrieb Ralph Bergmann: > I can access the manage webpage at port 8080 from my mac and it told

ClassNotFoundException

2015-03-16 Thread Ralph Bergmann
Hi, I want to try the JavaSparkPi example[1] on a remote Spark server but I get a ClassNotFoundException. When I run it local it works but not remote. I added the spark-core lib as dependency. Do I need more? Any ideas? Thanks Ralph [1] ... https://github.com/apache/spark/blob/master/exampl

RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread jaykatukuri
Hi all, I am trying to use the new ALS implementation under org.apache.spark.ml.recommendation.ALS. The new method to invoke for training seems to be override def fit(dataset: DataFrame, paramMap: ParamMap): ALSModel. How do I create a dataframe object from ratings data set that is on hdfs ?

Re: Problem connecting to HBase

2015-03-16 Thread HARIPRIYA AYYALASOMAYAJULA
Hello Ted, Yes, I can understand what you are suggesting. But I am unable to decipher where I am going wrong, could you please point out what are the locations to be looked at to be able to find and correct the mistake? I greatly appreciate your help! On Sun, Mar 15, 2015 at 1:10 PM, Ted Yu wr

Re: Parquet and repartition

2015-03-16 Thread Cheng Lian
Hey Masf, I’ve created SPARK-6360 to track this issue. Detailed analysis is provided there. The TL;DR is, for Spark 1.1 and 1.2, if a SchemaRDD contains decimal or UDT column(s), after applying any traditional RDD transformations (e.g. repart

[SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Shuai Zheng
Hi All, I am running Spark 1.2.1 and AWS SDK. To make sure AWS compatible on the httpclient 4.2 (which I assume spark use?), I have already downgrade to the version 1.9.0 But even that, I still got an error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.http.impl.co

Re: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Ted Yu
>From my local maven repo: $ jar tvf ~/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar | grep SchemeRegistry 1373 Fri Apr 19 18:19:36 PDT 2013 org/apache/http/impl/conn/SchemeRegistryFactory.class 2954 Fri Apr 19 18:19:36 PDT 2013 org/apache/http/conn/scheme/Sche

Any IRC channel on Spark?

2015-03-16 Thread Feng Lin
Hi, everyone, I'm wondering whether there is a possibility to setup an official IRC channel on freenode. I noticed that a lot of apache projects would have a such channel to let people talk directly. Best Michael

Basic GraphX deployment and usage question

2015-03-16 Thread Khaled Ammar
Hi, I'm very new to Spark and GraphX. I downloaded and configured Spark on a cluster, which uses Hadoop 1.x. The master UI shows all workers. The example command "run-example SparkPi" works fine and completes successfully. I'm interested in GraphX. Although the documentation says it is built-in w

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-16 Thread Eason Hu
Hi Akhil, Yes, I did change both versions on the project and the cluster. Any clues? Even the sample code from Spark website failed to work. Thanks, Eason On Sun, Mar 15, 2015 at 11:56 PM, Akhil Das wrote: > Did you change both the versions? The one in your build file of your > project and t

Creating a hive table on top of a parquet file written out by spark

2015-03-16 Thread kpeng1
Hi All, I wrote out a complex parquet file from spark sql and now I am trying to put a hive table on top. I am running into issues with creating the hive table itself. Here is the json that I wrote out to parquet using spark sql: {"user_id":"4513","providers":[{"id":"4220","name":"dbmvl","behavi

Re: Top rows per group

2015-03-16 Thread Xiangrui Meng
https://issues.apache.org/jira/browse/SPARK-5954 is for this issue and Shuo is working on it. We will first implement topByKey for RDD and them we could add it to DataFrames. -Xiangrui On Mon, Mar 9, 2015 at 9:43 PM, Moss wrote: > I do have a schemaRDD where I want to group by a given field F1,

Re: MappedStream vs Transform API

2015-03-16 Thread Tathagata Das
It's mostly for legacy reasons. First we had added all the MappedDStream, etc. and then later we realized we need to expose something that is more generic for arbitrary RDD-RDD transformations. It can be easily replaced. However, there is a slight value in having MappedDStream, for developers to le

Re: Spark Streaming with compressed xml files

2015-03-16 Thread Vijay Innamuri
textFileStream and default fileStream recognizes the compressed xml(.xml.gz) files. Each line in the xml file is an element in RDD[string]. Then whole RDD is converted to a proper xml format data and stored in a *Scala variable*. - I believe storing huge data in a *Scala variable* is ineffici

Re: Scaling problem in RandomForest?

2015-03-16 Thread Xiangrui Meng
Try increasing the driver memory. We store trees on the driver node. If maxDepth=20 and numTrees=50, you may need a large driver memory to store all tree models. You might want to start with a smaller maxDepth and then increase it and see whether deep trees really help (vs. the cost). -Xiangrui On

Re: Logistic Regression displays ERRORs

2015-03-16 Thread Xiangrui Meng
Actually, they should be INFO or DEBUG. Line search steps are expected. You can configure log4j.properties to ignore those. A better solution would be reporting this at https://github.com/scalanlp/breeze/issues -Xiangrui On Thu, Mar 12, 2015 at 5:46 PM, cjwang wrote: > I am running LogisticRegres

problems with spark-streaming-kinesis-asl and "sbt assembly" ("different file contents found")

2015-03-16 Thread Kelly, Jonathan
I'm attempting to use the Spark Kinesis Connector, so I've added the following dependency in my build.sbt: libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.3.0" My app works fine with "sbt run", but I can't seem to get "sbt assembly" to work without failing with

Re: Any way to find out feature importance in Spark SVM?

2015-03-16 Thread Xiangrui Meng
You can compute the standard deviations of the training data using Statistics.colStats and then compare them with model coefficients to compute feature importance. -Xiangrui On Fri, Mar 13, 2015 at 11:35 AM, Natalia Connolly wrote: > Hello, > > While running an SVMClassifier in spark, is ther

Re: RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread Xiangrui Meng
Try this: val ratings = purchase.map { line => line.split(',') match { case Array(user, item, rate) => (user.toInt, item.toInt, rate.toFloat) }.toDF("user", "item", "rate") Doc for DataFrames: http://spark.apache.org/docs/latest/sql-programming-guide.html -Xiangrui On Mon, Mar 16, 2015 at 9

RE: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Shuai Zheng
And it is an NoSuchMethodError, not a classnofound error And default I think the spark is only compile against Hadoop 2.2? For this issue itself, I just check the latest spark (1.3.0), its version can work (because it package with a newer version of httpclient, I can see the method is th

sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Shuai Zheng
Hi All, I just upgrade the system to use version 1.3.0, but then the sqlContext.parquetFile doesn't work with s3n. I have test the same code with 1.2.1 and it works. A simple test running in spark-shell: val parquetFile = sqlContext.parquetFile("""s3n:///test/2.parq """) java.lang.

partitionBy not working w HashPartitioner

2015-03-16 Thread Adrian Mocanu
Here's my use case: I read an array into an RDD and I use a hash partitioner to partition the RDD. This is the array type: Array[(String, Iterable[(Long, Int)])] topK:Array[(String, Iterable[(Long, Int)])] = ... import org.apache.spark.HashPartitioner val hashPartitioner=new HashPartitioner(10) v

Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Kelly, Jonathan
See https://issues.apache.org/jira/browse/SPARK-6351 ~ Jonathan From: Shuai Zheng mailto:szheng.c...@gmail.com>> Date: Monday, March 16, 2015 at 11:46 AM To: "user@spark.apache.org" mailto:user@spark.apache.org>> Subject: sqlContext.parquetFile doesn't work with s3n

RE: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Shuai Zheng
I see, but this is really a. big issue. anyway for me to work around? I try to set the fs.default.name = s3n, but looks like it doesn't work. I must upgrade to 1.3.0 because I face the package incompatible issue in 1.2.1, and if I must patch something, I rather go with latest version. Rega

Re: Spark Streaming with compressed xml files

2015-03-16 Thread Tathagata Das
That's why XMLInputFormat suggested by Akhil is a good idea. It should give you full XML object as on record, (as opposed to an XML record spread across multiple line records in textFileStream). Then you could convert each record into a json, thereby making it a json RDD. Then you can save it as a

Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Michael Armbrust
We will be including this fix in Spark 1.3.1 which we hope to make in the next week or so. On Mon, Mar 16, 2015 at 12:01 PM, Shuai Zheng wrote: > I see, but this is really a… big issue. anyway for me to work around? I > try to set the fs.default.name = s3n, but looks like it doesn’t work. > > >

Re: Basic GraphX deployment and usage question

2015-03-16 Thread Takeshi Yamamuro
Hi, Your're right, that is, graphx has already be included in a spark default package. As a first step, 'Analytics' seems to be suitable for your objective. # ./bin/run-example graphx.Analytics pagerank On Tue, Mar 17, 2015 at 2:21 AM, Khaled Ammar wrote: > Hi, > > I'm very new to Spark an

Re: problems with spark-streaming-kinesis-asl and "sbt assembly" ("different file contents found")

2015-03-16 Thread Tathagata Das
If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan wrote: > I'm attempting to u

  1   2   >