Adding more slaves to a running cluster

2015-11-24 Thread Dillian Murphey
What's the current status on adding slaves to a running cluster? I want to leverage spark-ec2 and autoscaling groups. I want to launch slaves as spot instances when I need to do some heavy lifting, but I don't want to bring down my cluster in order to add nodes. Can this be done by just running

Re: RDD partition after calling mapToPair

2015-11-24 Thread trung kien
Thanks Cody for very useful information. It's much more clear to me now. I had a lots of wrong assumptions. On Nov 23, 2015 10:19 PM, "Cody Koeninger" wrote: > Partitioner is an optional field when defining an rdd. KafkaRDD doesn't > define one, so you can't really assume

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist
Hi Abhi, You should be able to register a org.apache.spark.streaming.scheduler.StreamListener. There is an example here that may help: https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs here,

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist
Hi Abhi, Sorry that was the wrong link should have been the StreamListener, http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/scheduler/StreamingListener.html The BatchInfo can be obtained from the event, for example: public void

java.io.IOException when using KryoSerializer

2015-11-24 Thread Piero Cinquegrana
Hello, I am using spark 1.4.1 with Zeppelin. When using the kryo serializer, spark.serializer = org.apache.spark.serializer.KryoSerializer instead of the default Java serializer I am getting the following error. Is this a known issue? Thanks, Piero java.io.IOException: Failed to connect to

Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming?

2015-11-24 Thread SRK
Hi, Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming? We currently use Kafka Direct for Spark Streaming and it seems to be failing out when there is a leader loss and we can't really guarantee that there won't be any leader loss due rebalancing. If

Re: Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming?

2015-11-24 Thread Cody Koeninger
The direct stream shouldn't silently lose data in the case of a leader loss. Loss of a leader is handled like any other failure, retrying up to spark.task.maxFailures times. But really if you're losing leaders and taking that long to rebalance you should figure out what's wrong with your

queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
> > Hi, As a beginner ,I have below queries on Spork(Pig on Spark). I have cloned git clone https://github.com/apache/pig -b spark . 1.On which version of Pig and Spark , Spork is being built ? 2. I followed the steps mentioned in https://issues.apache.org/ji ra/browse/PIG-4059 and try to

Re: queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
Log files content : Pig Stack Trace --- ERROR 2998: Unhandled internal error. Could not initialize class org.apache.spark.rdd.RDDOperationScope$ java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ at

Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread AlexG
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster with 16.73 Tb storage, using distcp. The dataset is a collection of tar files of about 1.7 Tb each. Nothing else was stored in the HDFS, but after completing the download, the namenode page says that 11.59 Tb are in use.

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers
what is your hdfs replication set to? On Wed, Nov 25, 2015 at 1:31 AM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 > cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of about 1.7 Tb each. >

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG: Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 11.4TB. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched

Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread UMESH CHAUDHARY
Hi, I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in spark-shell. Now, I am experiencing reversed performance by Spark-Sql over Hive. By default Hive gives result back in 27 seconds for plain select * query on 1 GB dataset containing 3623203 records, while spark-sql gives back

RE: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-24 Thread Shuai Zheng
Hi All, Hi Just an update on this case. I try many different combination on settings (and I just upgrade to latest EMR 4.2.0 with Spark 1.5.2). I just found out that the problem is from: spark-submit --deploy-mode client --executor-cores=24 --driver-memory=5G

Spark sql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append

2015-11-24 Thread Siva Gudavalli
Ref:https://issues.apache.org/jira/browse/SPARK-11953 In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc They are replaced with write.jdbc() in Spark 1.4.1 CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table followed by INSERT (DML) InsertIntoJDBC

Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael
so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to set the output filenames. Thanks Burak :) -Michael On Mon, Nov 23, 2015, at 09:19 PM,

Spark 1.4.2- java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-24 Thread Sahil Sareen
I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help. This is the exception that I am getting: [MySparkApplication] WARN : Failed to execute SQL statement select * from TableS s join TableC c on s.property = c.property from X YZ org.apache.spark.SparkException: Job

Re: Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread Sabarish Sasidharan
First of all, select * is not a useful SQL to evaluate. Very rarely would a user require all 362K records for visual analysis. Second, collect() forces movement of all data from executors to the driver. Instead write it out to some other table or to HDFS. Also Spark is more beneficial when you

Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread jmvllt
Hi guys, This may be a stupid question. But I m facing an issue here. I found the class BinaryClassificationMetrics and I wanted to compute the aucROC or aucPR of my model. The thing is that the predict method of a LogisticRegressionModel only returns the predicted class, and not the

Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi Prem, Thank you for the details. I'm not able to build. I'm facing some issues. Any repository link, where I can download (preview version of) 1.6 version of spark-core_2.11 and spark-sql_2.11 jar files. Regards, Rajesh On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure

Re: Spark 1.6 Build

2015-11-24 Thread Ted Yu
See: http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Prem, > > Thank you for the details. I'm not able to build. I'm facing some issues. > > Any

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread Sean Owen
Your reasoning is correct; you need probabilities (or at least some score) out of the model and not just a 0/1 label in order for a ROC / PR curve to have meaning. But you just need to call clearThreshold() on the model to make it return a probability. On Tue, Nov 24, 2015 at 5:19 PM, jmvllt

Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi Ted, I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above link. Regards, Rajesh On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu wrote: > See: > > http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview > > On

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
thx for mentioning the build requirement But actually it is -*D*scala-2.11 (i.e. -D for java property instead of -P for profile) details: We can see this in the pom.xml scala-2.11 scala-2.11 2.11.7 2.11 So the scala-2.11

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
Hi Sabarish Thanks for the suggestion. I did not know about wholeTextFiles() By the way once your suggestion about repartitioning was spot on!. My run time for count() when from elapsed time:0:56:42.902407 to elapsed time:0:00:03.215143 on a data set of about 34M of 4720 records. Andy From:

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
Hi Don I went to a presentation given by Professor Ion Stoica. He mentioned that Python was a little slower in general because of the type system. I do not remember all of his comments. I think the context had to do with spark SQL and data frames. I wonder if the python issue is similar to the

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
Is it possible that the kafka offset api is somehow returning the wrong offsets. Because each time the job fails for different partitions with an error similar to the error that I get below. Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times, most recent failure: Lost task

Re: Spark Kafka Direct Error

2015-11-24 Thread Cody Koeninger
Anything's possible, but that sounds pretty unlikely to me. Are the partitions it's failing for all on the same leader? Have there been any leader rebalances? Do you have enough log retention? If you log the offset for each message as it's processed, when do you see the problem? On Tue, Nov 24,

Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-24 Thread Jeff Schecter
There is no codepath in the script /root/spark-ec2/spark/init.sh that can actually get to the version of spark 1.5.2 pre-built with Hadoop 2.6. I think the 2.4 version includes Hive as well... but setting hadoop major version to 2 won't actually get you there. Sigh. The documentation is the

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
HI Madabhattula Scala 2.11 requires building from source. Prebuilt binaries are available only for scala 2.10 >From the src folder: dev/change-scala-version.sh 2.11 Then build as you would normally either from mvn or sbt The above info *is* included in the spark docs but a little hard

Re: Spark 1.6 Build

2015-11-24 Thread Ted Yu
See also: https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/ w.r.t. building locally, please specify -Pscala-2.11 Cheers On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch wrote: > HI Madabhattula >

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
I see the assertion error when I compare the offset ranges as shown below. How do I log the offset for each message? kafkaStream.transform { rdd => // Get the offset ranges in the RDD offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.foreachRDD { rdd => for (o <-

Re: Spark SQL Save CSV with JSON Column

2015-11-24 Thread Davies Liu
I think you could have a Python UDF to turn the properties into JSON string: import simplejson def to_json(row): return simplejson.dumps(row.asDict(recursive=Trye)) to_json_udf = pyspark.sql.funcitons.udf(to_json) df.select("col_1", "col_2",

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳
OK, yarn.scheduler.maximum-allocation-mb is 16384. I have ran it again, the command to run it is: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster - -driver-memory 4g --executor-memory 8g lib/spark-examples*.jar 200 > > > 15/11/24 16:15:56 INFO

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Sabarish Sasidharan
If yarn has only 50 cores then it can support max 49 executors plus 1 driver application master. Regards Sab On 24-Nov-2015 1:58 pm, "谢廷稳" wrote: > OK, yarn.scheduler.maximum-allocation-mb is 16384. > > I have ran it again, the command to run it is: > ./bin/spark-submit

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao
Did you set this configuration "spark.dynamicAllocation.initialExecutors" ? You can set spark.dynamicAllocation.initialExecutors 50 to take try again. I guess you might be hitting this issue since you're running 1.5.0, https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot explain

Getting ParquetDecodingException when I am running my spark application from spark-submit

2015-11-24 Thread Kapil Raaj
The relevant error lines are: Caused by: parquet.io.ParquetDecodingException: Can't read value in column [roll_key] BINARY at value 19600 out of 4814, 19600 out of 19600 in currentPage. repetition level: 0, definition level: 1 Caused by: org.apache.spark.SparkException: Job aborted due to stage

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳
@Sab Thank you for your reply, but the cluster has 6 nodes which contain 300 cores and Spark application did not request resource from YARN. @SaiSai I have ran it successful with " spark.dynamicAllocation.initialExecutors" equals 50, but in

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao
The document is right. Because of a bug introduce in https://issues.apache.org/jira/browse/SPARK-9092 which makes this configuration fail to work. It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you could change to newer version of Spark. On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sean Owen
Not sure who generally handles that, but I just made the edit. On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > Sorry to be a nag, I realize folks with edit rights on the Powered by Spark > page are very busy people, but its been 10 days since my original request, >

[streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread ponkin
HI, When I create stream with KafkaUtils.createDirectStream I can explicitly define the position "largest" or "smallest" - where to read topic from. What if I have previous checkpoints( in HDFS for example) with offsets, and I want to start reading from the last checkpoint? In source code of

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳
Thank you very much, after change to newer version, it did work well! 2015-11-24 17:15 GMT+08:00 Saisai Shao : > The document is right. Because of a bug introduce in > https://issues.apache.org/jira/browse/SPARK-9092 which makes this > configuration fail to work. > > It

Re: Spark Expand Cluster

2015-11-24 Thread Dinesh Ranganathan
Thanks Christopher, I will try that. Dan On 20 November 2015 at 21:41, Bozeman, Christopher wrote: > Dan, > > > > Even though you may be adding more nodes to the cluster, the Spark > application has to be requesting additional executors in order to thus use > the added

indexedrdd and radix tree: how to search indexedRDD using all prefixes?

2015-11-24 Thread Mina
Hello, I have a question about radix tree (PART) implementation in Spark, IndexedRDD. I explored the source code and found out that the Radix tree used in IndexedRDD, only returns exact matches. However, it seems to have an restricted use, For example, I want to find children nodes using prefix

Re: indexedrdd and radix tree: how to search indexedRDD using all prefixes?

2015-11-24 Thread Mina
This is what a Radix tree returns -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/indexedrdd-and-radix-tree-how-to-search-indexedRDD-using-all-prefixes-tp25459p25460.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Reynold Xin
I just updated the page to say "email dev" instead of "email user". On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen wrote: > Not sure who generally handles that, but I just made the edit. > > On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > > Sorry to

Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi, I'm not able to build Spark 1.6 from source. Could you please share the steps to build Spark 1.16 Regards, Rajesh

Re: DateTime Support - Hive Parquet

2015-11-24 Thread Cheng Lian
I see, then this is actually irrelevant to Parquet. I guess can support Joda DateTime in Spark SQL reflective schema inference to have this, provided that this is a frequent use case and Spark SQL already has Joda as a direct dependency. On the other hand, if you are using Scala, you can

RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan
Cheng, I am using Scala. I have an implicit conversion from Joda DateTime to timestamp. My tables are defined with Timestamp. However explicit conversation appears to be required. Do you have an example of implicit conversion for this case? Do you convert on insert or on RDD to DF conversion?

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Deng Ching-Mallete
Hi, If you wish to read from checkpoints, you need to use StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to create the streaming context that you pass in to KafkaUtils.createDirectStream(...). You may refer to

RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan
Cheng, That’s exactly what I was hoping for – native support for writing DateTime objects. As it stands Spark 1.5.2 seems to leave no option but to do manual conversion (to nanos, Timestamp, etc) prior to writing records to hive. Regards, Bryan Jeffrey Sent from Outlook Mail From: Cheng

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Понькин Алексей
Great, thank you. Sorry for being so inattentive) Need to read docs carefully. -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 24.11.2015, 15:15, "Deng Ching-Mallete" : > Hi, > > If you wish to read from checkpoints, you need to use >

Re: Spark 1.6 Build

2015-11-24 Thread Prem Sure
you can refer..: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I'm not able to build Spark 1.6 from source. Could you please

Re: Experiences about NoSQL databases with Spark

2015-11-24 Thread Ted Yu
You should consider using HBase as the NoSQL database. w.r.t. 'The data in the DB should be indexed', you need to design the schema in HBase carefully so that the retrieval is fast. Disclaimer: I work on HBase. On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 wrote: >

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sujit Pal
Thank you Sean, much appreciated. And yes, perhaps "email dev" is a better option since the traffic is (probably) lighter and these sorts of requests are more likely to get noticed. Although one would need to subscribe to the dev list to do that... -sujit On Tue, Nov 24, 2015 at 1:16 AM, Sean