Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
That's awesome Yan. I was considering Phoenix for SQL calls to HBase since Cassandra supports CQL but HBase QL support was lacking. I will get back to you as I start using it on our loads. I am assuming the latencies won't be much different from accessing HBase through tsdb asynchbase as that's

A question about spark checkpoint

2015-07-28 Thread bit1...@163.com
Hi, I have following code that uses checkpoint to checkpoint the heavy ops,which works well that the last heavyOpRDD.foreach(println) will not recompute from the beginning. But when I re-run this program, the rdd computing chain will be recomputed from the beginning, I thought that it will

pyspark/py4j tree error

2015-07-28 Thread Dirk Nachbar
I am using pyspark and I want to test the sql function. I get this Java tree error. Any ideas. iwaggDF.registerTempTable('iwagg') hierDF.registerTempTable('hier') res3=sqlc.sql('select name, sum(amount) as amount from iwagg a left join hier b on a.segm=b.segm group by name order by sum(amount)

java.io.IOException: failure to login

2015-07-28 Thread glen
Hi, I’ve posted this question to stackoverflow.com here http://stackoverflow.com/questions/31534458/failing-integration-test-for-apache-spark-streaming but it’s not getting any responses. I've been trying to track down an issue with some unit/integration tests I've been writing for a Spark

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
That happens when you batch duration is less than your processing time, you need to set StorageLevel to MEMORY_AND_DISK, if you are using the latest version of spark and you are just exploring things, then you can go with the kafka consumers that comes with Spark itself. You will not have this

RE: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Manohar Reddy
Yaa got it Thanks Akhil. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, July 28, 2015 2:47 PM To: Manohar Reddy Cc: user@spark.apache.org Subject: Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client That happens when you batch duration is less than your processing

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
Hi Ayan Thanks for reply. Its around 5 GB having 10 tables...this data changes very frequently every minutes few updates its difficult to have this data in spark, if any updates happen on main tables, how can I refresh spark data? On 28 July 2015 at 02:11, ayan guha guha.a...@gmail.com wrote:

Re: use S3-Compatible Storage with spark

2015-07-28 Thread Schmirr Wurst
Hi recompiled and retried, now its looking like this with s3a : com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain S3n is working find, (only problem is still the endpoint) - To

Spark SQL ArrayOutofBoundsException Question

2015-07-28 Thread tranan
Hello all, I am currently having an error with Spark SQL access Elasticsearch using Elasticsearch Spark integration. Below is the series of command I issued along with the stacktrace. I am unclear what the error could mean. I can print the schema correctly but error out if i try and display a

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
Put a try catch inside your code and inside the catch print out the length or the list itself which causes the ArrayIndexOutOfBounds. It might happen that some of your data is not proper. Thanks Best Regards On Mon, Jul 27, 2015 at 8:24 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi

RE: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Manohar Reddy
Hi Akhil, Thanks for thereply.I found the root cause but don’t know how to solve this. Below is the cause.this map function not going inside to execute because of this all my list fields are empty. Please let me know what might be the cause to not execute this snippet of code.the below map is

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
I trying do that, but there will always data mismatch, since by the time scoop is fetching main database will get so many updates. There is something called incremental data fetch using scoop but that hits a database rather than reading the WAL edit. On 28 July 2015 at 02:52,

Re: ReceiverStream SPARK not able to cope up with 20,000 events /sec .

2015-07-28 Thread Akhil Das
You need to find the bottleneck here, it could your network (if the data is huge) or your producer code isn't pushing at 20k/s, If you are able to produce at 20k/s then make sure you are able to receive at that rate (try it without spark). Thanks Best Regards On Sat, Jul 25, 2015 at 3:29 PM,

Spark-Cassandra connector DataFrame

2015-07-28 Thread simon wang
Hi, I would like to get the recommendations to use Spark-Cassandra connector DataFrame feature. I was trying to save a Dataframe containing 8 Million rows to Cassandra through the Spark-Cassandra connector. Based on the Spark log, this single action took about 60 minutes to complete. I think it

Re: Spark on Mesos - Shut down failed while running spark-shell

2015-07-28 Thread Tim Chen
Hi Haripriya, Your master has registered it's public ip to be 127.0.0.1:5050 which won't be able to be reached by the slave node. If mesos didn't pick up the right ip you can specifiy one yourself via the --ip flag. Tim On Mon, Jul 27, 2015 at 8:32 PM, Haripriya Ayyalasomayajula

Spark Number of Partitions Recommendations

2015-07-28 Thread Rahul Palamuttam
Hi All, I was wondering why the recommended number for parallelism was 2 -3 times the number of cores on your cluster. Is the heuristic explained in any of the Spark papers? Or is it more of an agreed upon rule of thumb? Thanks, Rahul P -- View this message in context:

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Proust GZ Feng
Thanks Owen, the problem under Cygwin is while run spark-submit under 1.4.0, it simply report Error: Could not find or load main class org.apache.spark.launcher.Main This is because under Cygwin spark-class make the LAUNCH_CLASSPATH as

RE: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Manohar Reddy
Thanks Akhil.that solved now but below is the new stack trace. Don’t feel bad, am look into that but if it is there in your fingers please 15/07/28 09:03:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 77, ip-10-252-7-70.us-west-2.compute.internal): java.lang.Exception: Could

Re: use S3-Compatible Storage with spark

2015-07-28 Thread Akhil Das
With s3n try this out: *s3service.s3-endpoint*The host name of the S3 service. You should only ever change this value from the default if you need to contact an alternative S3 endpoint for testing purposes. Default: s3.amazonaws.com Thanks Best Regards On Tue, Jul 28, 2015 at 1:54 PM, Schmirr

Re: Question abt serialization

2015-07-28 Thread Akhil Das
Did you try it with just: (comment out line 27) println Count of spark: + file.filter({s - s.contains('spark')}).count() Thanks Best Regards On Sun, Jul 26, 2015 at 12:43 AM, tog guillaume.all...@gmail.com wrote: Hi I have been using Spark for quite some time using either scala or python.

Re: unsubscribe

2015-07-28 Thread Brandon White
NO! On Tue, Jul 28, 2015 at 5:03 PM, Harshvardhan Chauhan ha...@gumgum.com wrote: -- *Harshvardhan Chauhan* | Software Engineer *GumGum* http://www.gumgum.com/ | *Ads that stick* 310-260-9666 | ha...@gumgum.com

unsubscribe

2015-07-28 Thread Harshvardhan Chauhan
-- *Harshvardhan Chauhan* | Software Engineer *GumGum* http://www.gumgum.com/ | *Ads that stick* 310-260-9666 | ha...@gumgum.com

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Marcelo Vanzin
Can you run the windows batch files (e.g. spark-submit.cmd) from the cygwin shell? On Tue, Jul 28, 2015 at 7:26 PM, Proust GZ Feng pf...@cn.ibm.com wrote: Hi, Owen Add back the cygwin classpath detection can pass the issue mentioned before, but there seems lack of further support in the

Re: restart from last successful stage

2015-07-28 Thread ayan guha
Hi I do not think op asks about attempt failure but stage failure and finally leading to job failure. In that case, rdd info from last run is gone even if from cache, isn't it? Ayan On 29 Jul 2015 07:01, Tathagata Das t...@databricks.com wrote: If you are using the same RDDs in the both the

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Ashwin Giridharan
@Das, Is there anyway to identify a kafka topic when we have unified stream? As of now, for each topic I create dedicated DStream and use foreachRDD on each of these Streams. If I have say 100 kafka topics, then how can I use unified stream and still take topic specific actions inside foreachRDD ?

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Brandon White
Thank you Tathagata. My main use case for the 500 streams is to append new elements into their corresponding Spark SQL tables. Every stream is mapped to a table so I'd like to use the streams to appended the new rdds to the table. If I union all the streams, appending new elements becomes a

Re: broadcast variable question

2015-07-28 Thread Jonathan Coveney
That's great! Thanks El martes, 28 de julio de 2015, Ted Yu yuzhih...@gmail.com escribió: If I understand correctly, there would be one value in the executor. Cheers On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney jcove...@gmail.com javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');

Job hang when running random forest

2015-07-28 Thread Andy Zhao
Hi guys, A job hanged about 16 hours when I run random forest algorithm, I don't know why that happened. I use spark 1.4.0 on yarn and here is the code http://apache-spark-user-list.1001560.n3.nabble.com/file/n24047/1.png and following picture is from spark ui

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Proust GZ Feng
Hi, Owen Add back the cygwin classpath detection can pass the issue mentioned before, but there seems lack of further support in the launch lib, see below stacktrace LAUNCH_CLASSPATH: C:\spark-1.4.0-bin-hadoop2.3\lib\spark-assembly-1.4.0-hadoop2.3.0.jar java -cp

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Proust GZ Feng
Thanks Vanzin, spark-submit.cmd works Thanks Proust From: Marcelo Vanzin van...@cloudera.com To: Proust GZ Feng/China/IBM@IBMCN Cc: Sean Owen so...@cloudera.com, user user@spark.apache.org Date: 07/29/2015 10:35 AM Subject:Re: NO Cygwin Support in bin/spark-class in Spark

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
@Ashwin: You could append the topic in the data. val kafkaStreams = topics.map { topic = KafkaUtils.createDirectStream(topic...).map { x = (x, topic) } } val unionedStream = context.union(kafkaStreams) @Brandon: I dont recommend it, but you could do something crazy like use the

Job hang when running random forest

2015-07-28 Thread Andy Zhao
Hi guys When I run random forest algorithm, a job hanged for 15.8h, I can not figure out why that happened. Here is the code http://apache-spark-user-list.1001560.n3.nabble.com/file/n24046/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7_2015-07-29_%E4%B8%8A%E5%8D%8810.png And I use spark 1.4.0 on yarn

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also it setsup 500 foreachRDD operations that will get executed in each batch sequentially, so does not make sense. The

Re: Spark Streaming Json file groupby function

2015-07-28 Thread swetha
Hi TD, We have a requirement to maintain the user session state and to maintain/update the metrics for minute, day and hour granularities for a user session in our Streaming job. Can I keep those granularities in the state and recalculate each time there is a change? How would the

broadcast variable question

2015-07-28 Thread Jonathan Coveney
i am running in coarse grained mode, let's say with 8 cores per executor. If I use a broadcast variable, will all of the tasks in that executor share the same value? Or will each task broadcast its own value ie in this case, would there be one value in the executor shared by the 8 tasks, or would

Re: broadcast variable question

2015-07-28 Thread Ted Yu
If I understand correctly, there would be one value in the executor. Cheers On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney jcove...@gmail.com wrote: i am running in coarse grained mode, let's say with 8 cores per executor. If I use a broadcast variable, will all of the tasks in that

Re: Spark Streaming Json file groupby function

2015-07-28 Thread Tathagata Das
If you are trying to keep such long term state, it will be more robust in the long term to use a dedicated data store (cassandra/HBase/etc.) that is designed for long term storage. On Tue, Jul 28, 2015 at 4:37 PM, swetha swethakasire...@gmail.com wrote: Hi TD, We have a requirement to

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Ted Yu
Can you show the full stack trace ? Which Spark release are you using ? Thanks On Jul 27, 2015, at 10:07 AM, Wayne Song wayne.e.s...@gmail.com wrote: Hello, I am trying to start a Spark master for a standalone cluster on an EC2 node. The CLI command I'm using looks like this:

Re: Multiple operations on same DStream in Spark Streaming

2015-07-28 Thread Dean Wampler
Is this average supposed to be across all partitions? If so, it will require some one of the reduce operations in every batch interval. If that's too slow for the data rate, I would investigate using PairDStreamFunctions.updateStateByKey to compute the sum + count of the 2nd integers, per 1st

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic
You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit: https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs On 22.7.2015. 17:00, Louis Hust wrote: My code like below: MapString,

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic
Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in Spark in Action book: http://www.manning.com/bonaci/ On 27.7.2015. 10:22, Akhil Das wrote: You can follow this doc

Re: Multiple operations on same DStream in Spark Streaming

2015-07-28 Thread Akhil Das
One approach would be to store the batch data in an intermediate storage (like HBase/MySQL or even in zookeeper), and inside your filter function you just go and read the previous value from this storage and do whatever operation that you are supposed to do. Thanks Best Regards On Sun, Jul 26,

Checkpoints in SparkStreaming

2015-07-28 Thread Guillermo Ortiz
I'm using SparkStreaming and I want to configure checkpoint to manage fault-tolerance. I've been reading the documentation. Is it necessary to create and configure the InputDSStream in the getOrCreate function? I checked the example in

Re: Question abt serialization

2015-07-28 Thread tog
Hi Akhil I have it working now with Groovy REPL in a form similar to the one you are mentionning. Still I dont understand why the previous form (with the Function) is raising that exception. Cheers Guillaume On 28 July 2015 at 08:56, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try it

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Akhil Das
Did you try binding to 0.0.0.0? Thanks Best Regards On Mon, Jul 27, 2015 at 10:37 PM, Wayne Song wayne.e.s...@gmail.com wrote: Hello, I am trying to start a Spark master for a standalone cluster on an EC2 node. The CLI command I'm using looks like this: Note that I'm specifying the

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
You need to trigger an action on your rowrdd for it to execute the map, you can do a rowrdd.count() for that. Thanks Best Regards On Tue, Jul 28, 2015 at 2:18 PM, Manohar Reddy manohar.re...@happiestminds.com wrote: Hi Akhil, Thanks for thereply.I found the root cause but don’t know how

Re: use S3-Compatible Storage with spark

2015-07-28 Thread Schmirr Wurst
I tried those 3 possibilities, and everything is working = endpoint param is not working : sc.hadoopConfiguration.set(s3service.s3-endpoint,test) sc.hadoopConfiguration.set(fs.s3n.endpoint,test) sc.hadoopConfiguration.set(fs.s3n.s3-endpoint,test) 2015-07-28 10:28 GMT+02:00 Akhil Das

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Cody Koeninger
You don't have to use some other package in order to get access to the offsets. Shushant, have you read the available documentation at http://spark.apache.org/docs/latest/streaming-kafka-integration.html https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md or watched

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Ted Yu
Can you show us the snippet of the exception stack ? Thanks On Jul 27, 2015, at 10:22 PM, Stephen Boesch java...@gmail.com wrote: when using spark-submit: which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries

Re: log file directory

2015-07-28 Thread Ted Yu
Path to log file should be displayed when you launch the master. e.g. /mnt/var/log/apps/spark -hadoop-org.apache.spark.deploy.master.Master-MACHINENAME.out On Mon, Jul 27, 2015 at 11:28 PM, Jack Yang j...@uow.edu.au wrote: Hi all, I have questions with regarding to the log file directory.

Re: Checkpoints in SparkStreaming

2015-07-28 Thread Cody Koeninger
Yes, you need to follow the documentation. Configure your stream, including the transformations made to it, inside the getOrCreate function. On Tue, Jul 28, 2015 at 3:14 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I'm using SparkStreaming and I want to configure checkpoint to manage

Re: Clustetr setup for SPARK standalone application:

2015-07-28 Thread Dean Wampler
When you say you installed Spark, did you install the master and slave services for standalone mode as described here http://spark.apache.org/docs/latest/spark-standalone.html? If you intended to run Spark on Hadoop, see here http://spark.apache.org/docs/latest/running-on-yarn.html. It looks like

Checkpoint issue in spark streaming

2015-07-28 Thread Sadaf
Hi all. I am writing a twitter connector using spark streaming. i have written the following code to maintain checkpoint. val ssc=StreamingContext.getOrCreate(hdfs://192.168.23.109:9000/home/cloud9/twitterCheckpoint,()= { managingContext() }) def managingContext():StreamingContext = {

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-28 Thread Cody Koeninger
That stacktrace looks like an out of heap space on the driver while writing checkpoint, not on the worker nodes. How much memory are you giving the driver? How big are your stored checkpoints? On Tue, Jul 28, 2015 at 9:30 AM, Nicolas Phung nicolas.ph...@gmail.com wrote: Hi, After using

Fwd: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Priya Ch
Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my streaming application, I would configure 3 receivers with 1 thread each such that they would receive 3 dstreams (from 3 partitions of kafka

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic
Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in Spark in Action book: http://www.manning.com/bonaci/ On 24.7.2015. 7:26, Siva Reddy wrote: Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I create a maven projects

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic
You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit: https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs On 22.7.2015. 17:00, Louis Hust wrote: My code like below: MapString,

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Oops, yes, I'm still messing with the repo on a daily basis.. fixed On 28 July 2015 at 17:11, Ted Yu yuzhih...@gmail.com wrote: I got a compilation error: [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling [INFO] Compiling 18 source files to

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote: Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with

projection optimization?

2015-07-28 Thread Eric Friedman
If I have a Hive table with six columns and create a DataFrame (Spark 1.4.1) using a sqlContext.sql(select * from ...) query, the resulting physical plan shown by explain reflects the goal of returning all six columns. If I then call select(one_column) on that first DataFrame, the resulting

PySpark MLlib Numpy Dependency

2015-07-28 Thread Eskilson,Aleksander
The documentation for the Numpy dependency for MLlib seems somewhat vague [1]. Is Numpy only a dependency for the driver node, or must it also be installed on every worker node? Thanks, Alek [1] -- http://spark.apache.org/docs/latest/mllib-guide.html#dependencies CONFIDENTIALITY NOTICE This

Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into

sc.parallelise to work more like a producer/consumer?

2015-07-28 Thread Kostas Kougios
Hi, I am using sc.parallelise(...32k of items) several times for 1 job. Each executor takes x amount of time to process it's items but this results in some executors finishing quickly and staying idle till the others catch up. Only after all executors complete the first 32k batch, the next batch

spark-csv number of partitions

2015-07-28 Thread Srikanth
Hello, I'm using spark-csv instead of sc.textfile() to work with CSV files. How can I set no.of partitions that will be created when reading a CSV? Basically an equivalent for minPartitions in textFile() val myrdd = sc.textFile(my.csv,24) Srikanth

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Yu
I got a compilation error: [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes at 1438099569598 [ERROR] /home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36:

Messages are not stored for actorStream when using RoundRobinRouter

2015-07-28 Thread Juan Rodríguez Hortalá
Hi, I'm using a simple akka actor to create a actorStream. The actor just forwards the messages received to the stream by calling super[ActorHelper].store(msg). This works ok when I create the stream with ssc.actorStream[A](Props(new ProxyReceiverActor[A]), receiverActorName) but when I try to

Iterating over values by Key

2015-07-28 Thread gulyasm
I have K/V pairs where V is an Iterable (from previous groupBy). I use the JAVA API. What I want is to iterate over the values by key, and on every element set previousElementId attribute, that is the id of the previous element in the sorted list. I try to do this with mapValues. I create an

spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Shushant Arora
Hi I am processing kafka messages using spark streaming 1.3. I am using mapPartitions function to process kafka message. How can I access offset no of individual message getting being processed. JavaPairInputDStreambyte[], byte[] directKafkaStream =KafkaUtils.createDirectStream(..);

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
Does adding back the cygwin detection and this clause make it work? if $cygwin; then CLASSPATH=`cygpath -wp $CLASSPATH` fi If so I imagine that's fine to bring back, if that's still needed. On Tue, Jul 28, 2015 at 9:49 AM, Proust GZ Feng pf...@cn.ibm.com wrote: Thanks Owen, the problem under

Re: *Metrics API is odd in MLLib

2015-07-28 Thread Sam
Hi Xiangrui Spark People, I recently got round to writing an evaluation framework for Spark that I was hoping to PR into MLLib and this would solve some of the aforementioned issues. I have put the code on github in a separate repo for now as I would like to get some sandboxed feedback. The

Authentication Support with spark-submit cluster mode

2015-07-28 Thread Anh Hong
Hi,I'd like to remotely run spark-submit from a local machine to submit a job to spark cluster (cluster mode). What method do I use to authenticate myself to the cluster? Like how to pass user id or password or private key to the cluster Any help is appreciated.

SparkR does not include SparkContext

2015-07-28 Thread Siegfried Bilstein
Hi, I'm starting R on Spark via the sparkR script but I can't access the sparkcontext as described in the programming guide. Any ideas? Thanks, Siegfried

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Jörn Franke
You may check out apache phoenix on top of Hbase for this. However, it does not have ODBC drivers, but JDBC ones. Maybe Hive 1.2 with a new version of TEZ will also serve your purpose. You should run some proof of concept with these technologies using real or generated data. About how much data

Does spark-submit support file transfering from local to cluster?

2015-07-28 Thread Anh Hong
Hi, I'm using spark-submit cluster mode to submit a job from local to spark cluster. There are input files, output files, and job log files that I need to transfer in and out between local machine and spark cluster.Any recommendation methods to use file transferring. Is there any future

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Ruslan Dautkhanov
We want these use actions respond within 2 to 5 seconds. I think this goal is a stretch for Spark. Some queries may run faster than that on a large dataset, but in general you can't put an SLA like this. For example if you have to join some huge datasets, you'll likely will be much over that.

Re: Data from PostgreSQL to Spark

2015-07-28 Thread santoshv98
Sqoop’s incremental data fetch will reduce the data size you need to pull from source, but then by the time that incremental data fetch is complete, is it not current again, if velocity of the data is high? May be you can put a trigger in Postgres to send data to the big data cluster as

Re: hive.contrib.serde2.RegexSerDe not found

2015-07-28 Thread Gianluca Privitera
Try use: org.apache.hadoop.hive.serde2.RegexSerDe GP On 27 Jul 2015, at 09:35, ZhuGe t...@outlook.commailto:t...@outlook.com wrote: Hi all: I am testing the performance of hive on spark sql. The existing table is created with ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Burak Yavuz
Hey Stephen, In case these libraries exist on the client as a form of maven library, you can use --packages to ship the library and all it's dependencies, without building an uber jar. Best, Burak On Tue, Jul 28, 2015 at 10:23 AM, Marcelo Vanzin van...@cloudera.com wrote: Hi Stephen, There

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
can the source write to Kafka/Flume/Hbase in addition to Postgres? no it can't write ,this is due to the fact that there are many applications those are producing this postGreSql data.I can't really asked all the teams to start writing to some other source. velocity of the application is too

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Marcelo Vanzin
Hi Stephen, There is no such directory currently. If you want to add an existing jar to every app's classpath, you need to modify two config values: spark.driver.extraClassPath and spark.executor.extraClassPath. On Mon, Jul 27, 2015 at 10:22 PM, Stephen Boesch java...@gmail.com wrote: when

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jörn Franke
Can you put some transparent cache in front of the database? Or some jdbc proxy? Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele gangele...@gmail.com a écrit : can the source write to Kafka/Flume/Hbase in addition to Postgres? no it can't write ,this is due to the fact that there are many

Actor not found for: ActorSelection

2015-07-28 Thread Haseeb
I just cloned the master repository of Spark from Github. I am running it on OSX 10.9, Spark 1.4.1 and Scala 2.10.4 I just tried to run the SparkPi example program using IntelliJ Idea but get the error : akka.actor.ActorNotFound: Actor not found for:

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
Thanks Corey for your answer, Do you mean that final status : SUCCEEDED in terminal logs means that YARN RM could clean the resources after the application has finished (application finishing does not necessarily mean succeeded or failed) ? With that logic it totally makes sense. Basically the

DataFrame DAG recomputed even though DataFrame is cached?

2015-07-28 Thread Kristina Rogale Plazonic
Hi, I'm puzzling over the following problem: when I cache a small sample of a big dataframe, the small dataframe is recomputed when selecting a column (but not if show() or count() is invoked). Why is that so and how can I avoid recomputation of the small sample dataframe? More details: - I

Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Hi all, I am experimenting and learning performance on big tasks locally, with a 32 cores node and more than 64GB of Ram, data is loaded from a database through JDBC driver, and launching heavy computations against it. I am presented with two questions: 1. My RDD is poorly distributed. I

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
This might be an issue with how pyspark propagates the error back to the AM. I'm pretty sure this does not happen for Scala / Java apps. Have you filed a bug? On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov elkhan8...@gmail.com wrote: Thanks Corey for your answer, Do you mean that final

Re: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread shenyan zhen
Hi Saif, Are you using JdbcRDD directly from Spark? If yes, then the poor distribution could be due to the bound key you used. See the JdbcRDD Scala doc at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD : sql the text of the query. The query must contain

RE: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Thank you for your response Zhen, I am using some vendor specific JDBC driver JAR file (honestly I dont know where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from DataFrameReader

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
BTW this is most probably caused by this line in PythonRunner.scala: System.exit(process.waitFor()) The YARN backend doesn't like applications calling System.exit(). On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin van...@cloudera.com wrote: This might be an issue with how pyspark

Re: DataFrame DAG recomputed even though DataFrame is cached?

2015-07-28 Thread Michael Armbrust
We will try to address this before Spark 1.5 is released: https://issues.apache.org/jira/browse/SPARK-9141 On Tue, Jul 28, 2015 at 11:50 AM, Kristina Rogale Plazonic kpl...@gmail.com wrote: Hi, I'm puzzling over the following problem: when I cache a small sample of a big dataframe, the

restart from last successful stage

2015-07-28 Thread Alex Nastetsky
Is it possible to restart the job from the last successful stage instead of from the beginning? For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long time and is successful, but the job fails on stage 1, it would be useful to be able to restart from the output of stage 0

Re: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread shenyan zhen
Saif, I am guessing but not sure your use case. Are you retrieving the entire table into Spark? If yes, do you have primary key on your table? If also yes, then JdbcRDD should be efficient. DataFrameReader.jdbc gives you more options, again, depends on your use case. Possible for you to describe

Re: Actor not found for: ActorSelection

2015-07-28 Thread Haseeb
The problem was that I was trying to start the example app in standalone cluster mode by passing in *-Dspark.master=spark://myhost:7077* as an argument to the JVM. I launched the example app locally using -*Dspark.master=local* and it worked. -- View this message in context:

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
But then how can we get to know if job is making progress in programmatic way (Java) ? Or if job has failed or succeeded ? Is looking to application log files the only way knowing about job final status (failed/succeeded) ? Because when job fails Job History server does not have much info about

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
Thanks a lot for feedback, Marcelo. I've filed a bug just now - SPARK-9416 https://issues.apache.org/jira/browse/SPARK-9416 On Tue, Jul 28, 2015 at 12:14 PM, Marcelo Vanzin van...@cloudera.com wrote: BTW this is most probably caused by this line in PythonRunner.scala:

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Steve Loughran
there's a spark-submit.cmd file for windows. Does that work? On 27 Jul 2015, at 21:19, Proust GZ Feng pf...@cn.ibm.commailto:pf...@cn.ibm.com wrote: Hi, Spark Users Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of Cygwin support in bin/spark-class The changeset is

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-28 Thread Xiangrui Meng
Hi Stahlman, finalRDDStorageLevel is the storage level for the final user/item factors. It is not common to set it to StorageLevel.NONE, unless you want to save the factors directly to disk. So if it is NONE, we cannot unpersist the intermediate RDDs (in/out blocks) because the final user/item

Re: Proper saving/loading of MatrixFactorizationModel

2015-07-28 Thread Xiangrui Meng
The partitioner is not saved with the RDD. So when you load the model back, we lose the partitioner information. You can call repartition on the user/product factors and then create a new MatrixFactorizationModel object using the repartitioned RDDs. It would be useful to create a utility method

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
It wasn't removed, but rewritten. Cygwin is just a distribution of POSIX-related utilities so you should be able to use the normal .sh scripts. In any event, you didn't say what the problem is? On Tue, Jul 28, 2015 at 5:19 AM, Proust GZ Feng pf...@cn.ibm.com wrote: Hi, Spark Users Looks like

  1   2   >