Re: what is algorithm to optimize function with nonlinear constraints

2015-12-01 Thread Zhiliang Zhu
Thanks a lot for Ushnish 's kind reply. I am considering to apply simulated annealing algorithm for this question.However, there may be just one issue - sensibility , that is to say, for the totally same system of the function / constraints, the solution may be different while runningthe program

Re: spark rdd grouping

2015-12-01 Thread Rajat Kumar
What if I don't have to use aggregate function only groupbykeylocally() and then a map transformation? Will reduceByKeyLocally help here? Or is there any workaround if groupbykey is not locally and is global across all partitions. Thanks On Tue, Dec 1, 2015 at 5:20 PM, ayan guha wrote: > I bel

Re: Low Latency SQL query

2015-12-01 Thread ayan guha
You can try query push down by creating the query while creating the rdd. On 2 Dec 2015 12:32, "Fengdong Yu" wrote: > It depends on many situations: > > 1) what’s your data format? csv(text) or ORC/parquet? > 2) Did you have Data warehouse to summary/cluster your data? > > > if your data is tex

Re: Can Spark Execute Hive Update/Delete operations

2015-12-01 Thread 张炜
Hello Ted and all, We are using Hive 1.2.1 and Spark 1.5.1 I also noticed that there are other users reporting this problem. http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-spark-on-hive-td25372.html#a25486 Thanks a lot for help! Regards, Sai On Wed, Dec 2, 2015 at 11:11 AM Ted Yu

Re: Spark on Mesos with Centos 6.6 NFS

2015-12-01 Thread Akhil Das
Can you try mounting the NFS directory on all machines on the same location? (say /mnt/nfs) and try it again? Thanks Best Regards On Thu, Nov 26, 2015 at 1:22 PM, leonidas wrote: > Hello, > I have a setup with spark 1.5.1 on top of Mesos with one master and 4 > slaves. I am submitting a Spark j

Graphx: How to print the group of connected components one by one

2015-12-01 Thread Zhang, Jingyu
Can anyone please let me know How to print all nodes in connected components one by one? graph.connectedComponents() e.g. connected Component ID Nodes ID 1 1,2,3 6 6,7,8,9 Thanks -- This message

Re: General question on using StringIndexer in SparkML

2015-12-01 Thread Vishnu Viswanath
Hi Jeff, I went through the link you provided and I could understand how the fit() and transform() work. I tried to use the pipeline in my code and I am getting exception Caused by: org.apache.spark.SparkException: Unseen label: The reason for this error as per my understanding is: For the colum

Retrieving the PCA parameters in pyspark

2015-12-01 Thread Rohit Girdhar
Hi I'm using PCA through the python interface for spark, as per the instructions on this page: https://spark.apache.org/docs/1.5.1/ml-features.html#pca It works fine for learning the parameters and transforming the data. However, I'm unable to find a way to retrieve the learnt PCA parameters. I t

Increasing memory usage on batch job (pyspark)

2015-12-01 Thread Aaron Jackson
Greetings, I am processing a "batch" of files and have structured an iterative process around them. Each batch is processed by first loading the data with spark-csv, performing some minor transformations and then writing back out as parquet. Absolutely no caching or shuffle should occur with anyt

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu
Thanks Marcelo, But I have a single server(JVM) that is creating SparkContext, are you saying Spark supports multiple SparkContext in the same JVM? Could you please clarify on this? Thanks Anfernee On Tue, Dec 1, 2015 at 8:14 PM, Marcelo Vanzin wrote: > On Tue, Dec 1, 2015 at 3:32 PM, Anferne

Spark Streaming - History UI

2015-12-01 Thread patcharee
Hi, On my history server UI, I cannot see "streaming" tab for any streaming jobs? I am using version 1.5.1. Any ideas? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-m

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Dibyendu Bhattacharya
Hi, if you use Receiver based consumer which is available in spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) , this has all built in failure recovery and it can recover from any Kafka leader changes and offset out of ranges issue. Here is the package form github :

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Marcelo Vanzin
On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu wrote: > I have a long running backend server where I will create a short-lived Spark > job in response to each user request, base on the fact that by default > multiple Spark Context cannot be created in the same JVM, looks like I have > 2 choices > > 2

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > > df.write.partitionBy("date").i

Re: Spark Expand Cluster

2015-12-01 Thread Alexander Pivovarov
Try to run spark shell with correct number of executors e.g. for 10 box cluster running on r3.2xlarge (61 RAM, 8 cores) you can use the following spark-shell \ --num-executors 20 \ --driver-memory 2g \ --executor-memory 24g \ --executor-cores 4 you might also want to set spark.y

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Jeff Zhang
I don't think there's api for that, but think it is reasonable and helpful for ETL. As a workaround you can first register your dataframe as temp table, and use sql to insert to the static partition. On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan wrote: > Hello, > > Is there any API to insert d

Re: Can Spark Execute Hive Update/Delete operations

2015-12-01 Thread Ted Yu
Can you tell us the version of Spark and hive you use ? Thanks On Tue, Dec 1, 2015 at 7:08 PM, 张炜 wrote: > Dear all, > We have a requirement that needs to update delete records in hive. These > operations are available in hive now. > > But when using hiveContext in Spark, it always pops up an "

Can Spark Execute Hive Update/Delete operations

2015-12-01 Thread 张炜
Dear all, We have a requirement that needs to update delete records in hive. These operations are available in hive now. But when using hiveContext in Spark, it always pops up an "not supported" error. Is there anyway to support update/delete operations using spark? Regards, Sai

SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Isabelle Phan
Hello, Is there any API to insert data into a single partition of a table? Let's say I have a table with 2 columns (col_a, col_b) and a partition by date. After doing some computation for a specific date, I have a DataFrame with 2 columns (col_a, col_b) which I would like to insert into a specifi

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Josh Rosen
Yep, you shouldn't enable *spark.driver.allowMultipleContexts* since it has the potential to cause extremely difficult-to-debug task failures; it was originally introduced as an escape-hatch to allow users whose workloads happened to work "by accident" to continue using multiple active contexts, bu

Re: New to Spark

2015-12-01 Thread Ted Yu
Have you tried the following command ? REFRESH TABLE Cheers On Tue, Dec 1, 2015 at 1:54 AM, Ashok Kumar wrote: > Hi, > > I am new to Spark. > > I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. > > I have successfully made Hive metastore to be used by Spark. > > In spa

Re: New to Spark

2015-12-01 Thread fightf...@163.com
Hi,there Which version spark in your use case ? You made hive metastore to be used by Spark, that mean you can run sql queries over the current hive tables , right ? Or you just use local hive metastore embeded in spark sql side ? I think you need to provide more info for your spark sql and h

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Ted Yu
Looks like #2 is better choice. On Tue, Dec 1, 2015 at 4:51 PM, Anfernee Xu wrote: > Thanks Ted, so 1) is off from the table, can I go with 2), yarn-cluster > mode? As the driver is running as a Yarn container, it's should be OK for > my usercase, isn't it? > > Anfernee > > On Tue, Dec 1, 2015 a

Re: Low Latency SQL query

2015-12-01 Thread Fengdong Yu
It depends on many situations: 1) what’s your data format? csv(text) or ORC/parquet? 2) Did you have Data warehouse to summary/cluster your data? if your data is text or you query for the raw data, It should be slow, Spark cannot do much to optimize your job. > On Dec 2, 2015, at 9:21 AM,

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Mark, We have an application that use data from different kind of source, and we build a engine able to handle that, but cant scale with big data(we could but is to time expensive), and doesn't have Machine learning module, etc, we came across with Spark and it's looks like it have all we need, act

how to use spark.mesos.constraints

2015-12-01 Thread rarediel
I am trying to add mesos constraints to my spark-submit command in my marathon file I am setting it to spark.mesos.coarse=true. Here is an example of a constraint I am trying to set. --conf spark.mesos.constraint=cpus:2 I want to use the constraints to control the amount of executors are creat

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu
Thanks Ted, so 1) is off from the table, can I go with 2), yarn-cluster mode? As the driver is running as a Yarn container, it's should be OK for my usercase, isn't it? Anfernee On Tue, Dec 1, 2015 at 4:48 PM, Ted Yu wrote: > For #1, looks like the config is used in test suites: > > .se

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Ted Yu
For #1, looks like the config is used in test suites: .set("spark.driver.allowMultipleContexts", "true") ./sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala .set("spark.driver.allowMultipleContexts", "true") ./sql/core/src/test/scala/org/apache/spark/sql/exec

Re: Low Latency SQL query

2015-12-01 Thread Xiao Li
http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext Try to read this article. It might help you understand your problem. Thanks, Xiao Li 2015-12-01 16:36 GMT-08:00 Mark Hamstra : > I'd ask another question first: If your SQL que

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
I'd ask another question first: If your SQL query can be executed in a performant fashion against a conventional (RDBMS?) database, why are you trying to use Spark? How you answer that question will be the key to deciding among the engineering design tradeoffs to effectively use Spark or some othe

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
Right, you can't expect a completely cold first query to execute faster than the data can be retrieved from the underlying datastore. After that, lowest latency query performance is largely a matter of caching -- for which Spark provides at least partial solutions. On Tue, Dec 1, 2015 at 4:27 PM,

Re: Low Latency SQL query

2015-12-01 Thread Michal Klos
You should consider presto for this use case. If you want fast "first query" times it is a better fit. I think sparksql will catch up at some point but if you are not doing multiple queries against data cached in RDDs and need low latency it may not be a good fit. M > On Dec 1, 2015, at 7:23

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Ok, so latency problem is being generated because I'm using SQL as source? how about csv, hive, or another source? On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra wrote: > It is not designed for interactive queries. > > > You might want to ask the designers of Spark, Spark SQL, and particularly > s

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Thanks Jõrn, I didn't expect Spark to be faster than SQL, but just not that slow. We are tempted to use Spark as our hub of sources, that way we can access throw different data sources and normalize it. Currently we are saving the data in SQL becouse Spark latency, but the best would be execute di

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
> > It is not designed for interactive queries. You might want to ask the designers of Spark, Spark SQL, and particularly some things built on top of Spark (such as BlinkDB) about their intent with regard to interactive queries. Interactive queries are not the only designed use of Spark, but it

Master is listing DEAD slaves, can they be cleaned up?

2015-12-01 Thread Dillian Murphey
On the status page on port 8080 I see the slaves that have been stopped. They are marked as DEAD. If I implement an autoscaling system on aws, which is the plan, then I can see there be hundreds of slaves going up and down. Is there a way to force a cleanup of dead slaves so they are not piling u

Re: Low Latency SQL query

2015-12-01 Thread Jörn Franke
Hmm it will never be faster than SQL if you use SQL as an underlying storage. Spark is (currently) an in-memory batch engine for iterative machine learning workloads. It is not designed for interactive queries. Currently hive is going into the direction of interactive queries. Alternatives are

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
How to avoid those Errors with receiver based approach? Suppose we are OK with at least once processing and use receiver based approach which uses ZooKeeper but not query Kafka directly, would these errors(Couldn't find leader offsets for Set([test_stream,5])))be avoided? On Tue, Dec 1, 2015 a

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Cody Koeninger
KafkaRDD.scala , handleFetchErr On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy wrote: > Hi Cody, > > How to look at Option 2(see the following)? Which portion of the code in > Spark Kafka Direct to look at to handle this issue specific to our > requirements. > > > 2.Catch that exception and so

Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu
Hi, I have a doubt regarding yarn-cluster mode and spark.driver. allowMultipleContexts for below usercases. I have a long running backend server where I will create a short-lived Spark job in response to each user request, base on the fact that by default multiple Spark Context cannot be created

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
Following is the Option 2 that I was talking about: 2.Catch that exception and somehow force things to "reset" for that partition And how would it handle the offsets already calculated in the backlog (if there is one)? On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy wrote: > Hi Cody, > > How t

Re: Send JsonDocument to Couchbase

2015-12-01 Thread Eyal Sharon
anyone ? I know that there isn't much experience yet with Couchbase connector On Tue, Dec 1, 2015 at 4:12 PM, Eyal Sharon wrote: > Hi , > > I am still having problems with Couchbase connector . Consider the > following code fragment which aims to create a Json document to send to > Couch > > > >

Spark DIMSUM Memory requirement?

2015-12-01 Thread Parin Choganwala
Hi All, I am trying to run RowMatrix.similarity(0.5) on 60K users (n) with 130k features (m) on spark 1.3.0. Using 4 m3.2xlarge 30GB RAM and 8 cores but getting lots of ERROR YarnScheduler: Lost executor 1 on XXX.internal: remote Akka client disassociate What could be the reason? Is it shuffle

Driver Hangs before starting Job

2015-12-01 Thread Patrick Brown
Hi, I am building a Spark app which aggregates sensor data stored in Cassandra. After I submit my app to spark the driver and application show up quickly then, before any Spark job shows up in the application UI there is a huge lag, on the order of minutes to sometimes hours. Once the Spark job i

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Iulian Dragoș
As a I mentioned on the akka mailing list, in case others are following this thread: the issue isn't with dependencies. It's a bug in the maven-shade-plugin. It breaks classfiles when creating the assembly jar (it seems to do some constant propagation). `sbt assembly` doesn't suffer from this issue

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
Hi Cody, How to look at Option 2(see the following)? Which portion of the code in Spark Kafka Direct to look at to handle this issue specific to our requirements. 2.Catch that exception and somehow force things to "reset" for that partition And how would it handle the offsets already calculated

Graph testing question

2015-12-01 Thread Nathan Kronenfeld
I'm trying to test some graph operations I've written using GraphX. To make sure I catch all appropriate test cases, I'm trying to specify an input graph that is partitioned a specific way. Unfortunately, it seems graphx.Graph repartitions and shuffles any input node and edge RDD I give it. Is t

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Yes, The use case would be, Have spark in a service (I didnt invertigate this yet), through api calls of this service we perform some aggregations over data in SQL, We are already doing this with an internal development Nothing complicated, for instance, a table with Product, Product Family, cost,

ClassLoader resources on executor

2015-12-01 Thread Charles Allen
Is there a way to pass configuration file resources to be resolvable through the classloader? For example, if I'm using a library (non-spark) that can use a some-lib.properties file in the classpath/classLoader, can I pass that file so that when it tries to get the resource from the classloader it

Re: Low Latency SQL query

2015-12-01 Thread Jörn Franke
can you elaborate more on the use case? > On 01 Dec 2015, at 20:51, Andrés Ivaldi wrote: > > Hi, > > I'd like to use spark to perform some transformations over data stored inSQL, > but I need low Latency, I'm doing some test and I run into spark context > creation and data query over SQL tak

Re: spark-ec2 vs. EMR

2015-12-01 Thread Alexander Pivovarov
1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks 2. Emr has Ganglia 3.6.0 3. Emr has hadoop fs settings to make s3 work fast (direct.EmrFileSystem) 4. EMR has s3 keys in hadoop configs 5. EMR allows to resize cluster on fly. 6. EMR has aws sdk in spark classpath. Helps to re

Re: Low Latency SQL query

2015-12-01 Thread Josh Rosen
Use a long-lived SparkContext rather than creating a new one for each query. On Tue, Dec 1, 2015 at 11:52 AM Andrés Ivaldi wrote: > Hi, > > I'd like to use spark to perform some transformations over data stored > inSQL, but I need low Latency, I'm doing some test and I run into spark > context c

Re: Getting all files of a table

2015-12-01 Thread Krzysztof Zarzycki
Great that worked! The only problem was that it returned all the files including _SUCCESS and _metadata, but I filtered only the *.parquet Thanks Michael, Krzysztof 2015-12-01 20:20 GMT+01:00 Michael Armbrust : > sqlContext.table("...").inputFiles > > (this is best effort, but should work for h

Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi
Hi, I'd like to use spark to perform some transformations over data stored inSQL, but I need low Latency, I'm doing some test and I run into spark context creation and data query over SQL takes too long time. Any idea for speed up the process? regards. -- Ing. Ivaldi Andres

Re: spark streaming count msg in batch

2015-12-01 Thread Gerard Maas
dstream.count() See: http://spark.apache.org/docs/latest/programming-guide.html#actions -kr, Gerard. On Tue, Dec 1, 2015 at 6:32 PM, patcharee wrote: > Hi, > > In spark streaming how to count the total number of message (from Socket) > in one batch? > > Thanks, > Patcharee > >

Re: Getting all files of a table

2015-12-01 Thread Michael Armbrust
sqlContext.table("...").inputFiles (this is best effort, but should work for hive tables). Michael On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki wrote: > Hi there, > Do you know how easily I can get a list of all files of a Hive table? > > What I want to achieve is to get all files that

Getting all files of a table

2015-12-01 Thread Krzysztof Zarzycki
Hi there, Do you know how easily I can get a list of all files of a Hive table? What I want to achieve is to get all files that are underneath parquet table and using sparksql-protobuf[1] library(really handy library!) and its helper class ProtoParquetRDD: val protobufsRdd = new ProtoParquetRDD(s

Re: Grid search with Random Forest

2015-12-01 Thread Ndjido Ardo BAR
Thanks for the clarification. Gonna test that and give you feedbacks. Ndjido On Tue, 1 Dec 2015 at 19:29, Joseph Bradley wrote: > You can do grid search if you set the evaluator to a > MulticlassClassificationEvaluator, which expects a prediction column, not a > rawPrediction column. There's a

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
You can do grid search if you set the evaluator to a MulticlassClassificationEvaluator, which expects a prediction column, not a rawPrediction column. There's a JIRA for making BinaryClassificationEvaluator accept prediction instead of rawPrediction. Joseph On Tue, Dec 1, 2015 at 5:10 AM, Benjami

RE: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Boavida, Rodrigo
HI Jacek, Yes I was told that as well but no one gave me release schedules, and I have the immediate need to have Spark Applications communicating with Akka clusters based on latest version. I'm aware there is an ongoing effort to change to the low level netty implementation but AFAIK it's not

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Jacek Laskowski
On Tue, Dec 1, 2015 at 2:32 PM, RodrigoB wrote: > I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0. Why? AFAIK Spark's leaving Akka's boat and joins Netty's. Jacek - To unsubscribe, e-mail: user-unsubscr...@s

Re: Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Ted Yu
See SPARK-7252 If you issue query directly through hbase, does it work ? Cheers On Tue, Dec 1, 2015 at 8:43 AM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Spark - 1.3.1 > Hbase - 1.0.0 > Phoenix - 4.3 > Cloudera - 5.4 > > On Tue, Dec 1, 2015 at 9:35 PM, Ted Yu wrote: > >> What a

spark streaming count msg in batch

2015-12-01 Thread patcharee
Hi, In spark streaming how to count the total number of message (from Socket) in one batch? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache

Re: spark-ec2 vs. EMR

2015-12-01 Thread Jerry Lam
Simply put: EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR API + Selected Instance Types + Amazon EC2 Friendly (bootstrapping) spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any Instance Type I use spark-ec2 for prototyping and I have never use it for p

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu
>From the dependency tree, akka 2.4.0 was in effect. Maybe check the classpath of master to see if there is older version of akka. Cheers

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-01 Thread Darin McBeath
The problem isn't really with DTD validation (by default validation is disabled). The underlying problem is that the DTD can't be found (which is indicated in your stack trace below). The underlying parser will try and retrieve the DTD (regardless of validation) because things such as entit

Re: Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Akhilesh Pathodia
Spark - 1.3.1 Hbase - 1.0.0 Phoenix - 4.3 Cloudera - 5.4 On Tue, Dec 1, 2015 at 9:35 PM, Ted Yu wrote: > What are the versions for Spark / HBase / Phoenix you're using ? > > Cheers > > On Tue, Dec 1, 2015 at 4:15 AM, Akhilesh Pathodia < > pathodia.akhil...@gmail.com> wrote: > >> Hi, >> >> I am r

Migrate a cassandra table among from one cluster to another

2015-12-01 Thread George Sigletos
Hello, Does anybody know how to copy a cassandra table (or an entire keyspace) from one cluster to another using Spark? I haven't found anything very specific about this so far. Thank you, George

Re: spark-ec2 vs. EMR

2015-12-01 Thread Nick Chammas
Pinging this thread in case anyone has thoughts on the matter they want to share. On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Spark has come bundled with spark-ec2 > for many years. At > the same t

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger
I actually haven't tried that, since I tend to do the offset lookups if necessary. It's possible that it will work, try it and let me know. Be aware that if you're doing a count() or take() operation directly on the rdd it'll definitely give you the wrong result if you're using -1 for one of the

Re: Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Ted Yu
What are the versions for Spark / HBase / Phoenix you're using ? Cheers On Tue, Dec 1, 2015 at 4:15 AM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Hi, > > I am running spark job on yarn in cluster mode in secured cluster. Spark > executors are unable to get hbase connection using

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Alan Braithwaite
Neat, thanks. If I specify something like -1 as the offset, will it consume from the latest offset or do I have to instrument that manually? - Alan On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger wrote: > Yes, there is a version of createDirectStream that lets you specify > fromOffsets: Map[Top

Re: Spark and simulated annealing

2015-12-01 Thread marfago
HI, Thank you for your suggestion. Is the scipy library (in particular scipy.optimize.anneal function) still able to leverage the parallelism and distributed calculus offered by Spark? Marco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-simula

Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-01 Thread SRK
Hi, We need to monitor and identify if the Streaming job has been failing for the last 5 minutes and restart the job accordingly. In most cases our Spark Streaming with Kafka direct fails with leader lost errors. Or offsets not found errors for that partition. What is the most effective way to mo

RE: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Boavida, Rodrigo
Thanks that worked! I let you know the results. Tnks, Rod From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: 01 December 2015 15:36 To: Boavida, Rodrigo Cc: user@spark.apache.org Subject: Re: Scala 2.11 and Akka 2.4.0 Please specify the following in your maven commands: -Dscala-2.11 Cheers This e

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu
Please specify the following in your maven commands: -Dscala-2.11 Cheers

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu
I don't see 2.4.0 release under: http://mvnrepository.com/artifact/com.typesafe.akka/akka-remote_2.10 Probably that was the cause for the 'Could not find artifact' error. On Tue, Dec 1, 2015 at 7:03 AM, Boavida, Rodrigo wrote: > Hi Ted, > > Thanks for getting back to me and for the suggestion.

RE: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Boavida, Rodrigo
Hi Ted, Thanks for getting back to me and for the suggestion. Running a 'mvn dependency:tree' I get the following: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.5.2: The following artifacts could not

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Cody Koeninger
If you're consistently getting offset out of range exceptions, it's probably because messages are getting deleted before you've processed them. The only real way to deal with this is give kafka more retention, consume faster, or both. If you're just looking for a quick "fix" for an infrequent iss

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger
Yes, there is a version of createDirectStream that lets you specify fromOffsets: Map[TopicAndPartition, Long] On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite wrote: > Is there any mechanism in the kafka streaming source to specify the exact > partition id that we want a streaming job to consum

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu
Have you run 'mvn dependency:tree' and examined the output ? There should be some hint. Cheers > On Dec 1, 2015, at 5:32 AM, RodrigoB wrote: > > Hi, > > I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0. > I've changed the main pom.xml files to corresponding akka version and

Send JsonDocument to Couchbase

2015-12-01 Thread Eyal Sharon
Hi , I am still having problems with Couchbase connector . Consider the following code fragment which aims to create a Json document to send to Couch *val muCache = { val values = JsonArray.from(mu.toArray.map(_.toString)) val content = JsonObject.create().put("feature_mean", values).put("

diff between apps and waitingApps?

2015-12-01 Thread Romi Kuntsman
Hello, I'm collecting metrics of master.apps and master.waitingApps, and I see the values always match. ref: https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apac

Scala 2.11 and Akka 2.4.0

2015-12-01 Thread RodrigoB
Hi, I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0. I've changed the main pom.xml files to corresponding akka version and am getting the following exception when starting the master on standalone: Exception Details: Location: akka/dispatch/Mailbox.processAllSystemMessage

Turning off DTD Validation using XML Utils package - Spark

2015-12-01 Thread Shivalik
Hi Team, I've been using XML Utils library (http://spark-packages.org/package/elsevierlabs-os/spark-xml-utils) to parse XML using XPath in a spark job. One problem I am facing is with the DTDs. My XML file, has a doctype tag included in it. I want to turn off DTD validation using this library si

Re: Grid search with Random Forest

2015-12-01 Thread Benjamin Fradet
Someone correct me if I'm wrong but no there isn't one that I am aware of. Unless someone is willing to explain how to obtain the raw prediction column with the GBTClassifier. In this case I'd be happy to work on a PR. On 1 Dec 2015 8:43 a.m., "Ndjido Ardo BAR" wrote: > Hi Benjamin, > > Thanks,

Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread akhileshp
Hi, I am running spark job on yarn in cluster mode in secured cluster. Spark executors are unable to get hbase connection using phoenix. I am running knit command to get the ticket before starting the job and also keytab file and principal are correctly specified in connection URL. But still spark

Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Akhilesh Pathodia
Hi, I am running spark job on yarn in cluster mode in secured cluster. Spark executors are unable to get hbase connection using phoenix. I am running knit command to get the ticket before starting the job and also keytab file and principal are correctly specified in connection URL. But still spark

Re: spark rdd grouping

2015-12-01 Thread ayan guha
I believe reduceByKeyLocally was introduced for this purpose. On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski wrote: > Hi Rajat, > > My quick test has showed that groupBy will preserve the partitions: > > scala> > sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex > { case

Re: spark rdd grouping

2015-12-01 Thread Jacek Laskowski
Hi Rajat, My quick test has showed that groupBy will preserve the partitions: scala> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex { case (idx, iter) => val s = iter.toSeq; println(idx + " with " + s.size + " elements: " + s); s.toIterator }.groupBy(_._1).mapPartitionsW

Re: Spark streaming job hangs

2015-12-01 Thread Archit Thakur
Which version of spark you are runinng? Have you created Kafka-Directstream ? I am asking coz you might / might not be using receivers. Also, When you say hangs, you mean there is no other log after this and process still up? Or do you mean, it kept on adding the jobs but did nothing else. (I am op

New to Spark

2015-12-01 Thread Ashok Kumar
Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I creat

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Jacek Laskowski
On Tue, Dec 1, 2015 at 10:57 AM, Shams ul Haque wrote: > Thanks for the suggestion, i am going to try union. ...and please report your findings back. > And what is your opinion on 2nd question. Dunno. If you find a solution, let us know. Jacek

Re: Spark streaming job hangs

2015-12-01 Thread Paul Leclercq
You might not have enough cores to process data from Kafka > When running a Spark Streaming program locally, do not use “local” or > “local[1]” as the master URL. Either of these means that only one thread > will be used for running tasks locally. If you are using a input DStream > based on a rec

Re: Failing to execute Pregel shortest path on 22k nodes

2015-12-01 Thread Robineast
1. The for loop is executed in your driver program so will send each Pregel request serially to be executed on the cluster 2. Whilst caching/persisting may improve the runtime it shouldn't affect the memory bounds - if you ask to cache more than is available then cached RDDs will be dropped out of

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sushrut Ikhar
Hi, I have myself used union in a similar case. And applied reduceByKey on it. Union + reduceByKey will suffice join... but you will have to first use Map so that all values are of same datatype Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sonal Goyal
I think you should be able to join different rdds with same key. Have you tried that? On Dec 1, 2015 3:30 PM, "Praveen Chundi" wrote: > cogroup could be useful to you, since all three are PairRDD's. > > > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunct

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Praveen Chundi
cogroup could be useful to you, since all three are PairRDD's. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions Best Regards, Praveen On 01.12.2015 10:47, Shams ul Haque wrote: Hi All, I have made 3 RDDs of 3 different dataset, all RDDs are grou

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Shams ul Haque
Hi Jacek, Thanks for the suggestion, i am going to try union. And what is your opinion on 2nd question. Thanks Shams On Tue, Dec 1, 2015 at 3:23 PM, Jacek Laskowski wrote: > Hi, > > Never done it before, but just yesterday I found out about > SparkContext.union method that could help in your

New to Spark

2015-12-01 Thread Ashok Kumar
Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I create

  1   2   >