Re: Executor lost failure

2015-09-01 Thread Andrew Duffy
If you're using YARN with Spark 1.3.1, you could be running into https://issues.apache.org/jira/browse/SPARK-8119, although without more information it's impossible to know. On Tue, Sep 1, 2015 at 11:28 AM, Priya Ch wrote: > Hi All, > > I have a spark streaming

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
I feel need of pause and resume in streaming app :) Is there any limit on max queued jobs ? If yes what happens once that limit reaches? Does job gets killed? On Tue, Sep 1, 2015 at 10:02 PM, Cody Koeninger wrote: > Sounds like you'd be better off just failing if the

Re: How to avoid shuffle errors for a large join ?

2015-09-01 Thread Thomas Dudziak
While it works with sort-merge-join, it takes about 12h to finish (with 1 shuffle partitions). My hunch is that the reason for that is this: INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB to disk (62 times so far) (and lots more where this comes from). On Sat, Aug 29,

Executor lost failure

2015-09-01 Thread Priya Ch
Hi All, I have a spark streaming application which writes the processed results to cassandra. In local mode, the code seems to work fine. The moment i start running in distributed mode using yarn, i see executor lost failure. I increased executor memory to occupy entire node's memory which is

RE: What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Thank you, so I was inversely confused. At first I thought MLLIB was the future, but based on what you say. MLLIB will be the past. Intersting. This means that if I look forward over using the pipelines system, I shouldn't be obsolete. Any more insights welcome, Saif -Original Message-

Re: Reading xml in java using spark

2015-09-01 Thread Darin McBeath
Another option might be to leverage spark-xml-utils (https://github.com/dmcbeath/spark-xml-utils) This is a collection of xml utilities that I've recently revamped that make it relatively easy to use xpath, xslt, or xquery within the context of a Spark application (or at least I think so). My

Re: What is the current status of ML ?

2015-09-01 Thread Sean Owen
I think the story is that the new spark.ml "pipelines" API is the future. Most (all?) of the spark.mllib functionality has been ported over and/or translated. I don't know that spark.mllib will actually be deprecated soon -- not until spark.ml is fully blessed as 'stable' I'd imagine, at least.

Resource allocation in SPARK streaming

2015-09-01 Thread anshu shukla
I am not much clear about resource allocation (CPU/CORE/Thread level allocation) as per the parallelism by setting number of cores in spark standalone mode . Any guidelines for that . -- Thanks & Regards, Anshu Shukla

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Thanks for your prompt reply. I will follow https://issues.apache.org/jira/browse/SPARK-2394 and will let you know if everything works. Cheers, Bertrand -- View this message in context:

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Hello everybody, I followed the steps from https://issues.apache.org/jira/browse/SPARK-2394 to read LZO-compressed files, but now I cannot even open a file with : lines = sc.textFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram") >>> lines.first() Traceback (most

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Robineast
Do you have LZO configured? see http://stackoverflow.com/questions/14808041/how-to-have-lzo-compression-in-hadoop-mapreduce --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Intermittent performance degradation in Spark Streaming

2015-09-01 Thread Michael Siler
Hello, I'm running a small Spark Streaming instance: 4 node cluster, 1000 records per second coming in. For each record, I'm querying Cassandra, updating some very simple stats, and sending the results back to Cassandra. I'm using 10 second mini-batches, and it typically takes 8 seconds to

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-09-01 Thread Krishna Sangeeth KS
Hi Timothy, I think the driver memory in all your examples is more than what is necessary in usual cases and executor memory is quite less. I found this devops talk[1] at spark-summit here to be super useful in understanding few of this configuration details. [1]

What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Hi all, I am little bit confused, as getting introduced to Spark recently. What is going on with ML? Is it going to be deprecated? Or are all of its features valid and constructed over? It has a set of features and ML Constructors which I like to use, but need to confirm wether the future of

Conditionally do things different on the first minibatch vs subsequent minibatches in a dstream

2015-09-01 Thread steve_ash
We have some logic that we need to apply while we are processing the events in the first minibatch only. For the second, third, etc. minibatches we don't need to do this special logic. I can't just do it as a one time thing - I need to modify a field on the events in the first minibatch. One

Re: How mature is spark sql

2015-09-01 Thread Jörn Franke
Depends on what you need to do. Can you tell more about your use cases? Le mar. 1 sept. 2015 à 13:07, rakesh sharma a écrit : > Is it mature enough to use it extensively. I see that it is easier to do > than writing map/reduce in java. > We are being asked to do it

Re: Custom Partitioner

2015-09-01 Thread Davies Liu
You can take the sortByKey as example: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642 On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker wrote: > something like... > > class RangePartitioner(Partitioner): > def __init__(self, numParts): > self.numPartitions

Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
This is master log. There's no worker registration info in the log. That means the worker may not start properly. Please check the log file with apache.spark.deploy.worker in file name. On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa wrote: > I cannot see anything

Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
It's in SPARK_HOME/logs Or you can check the spark web ui. http://[master-machine]:8080 On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa wrote: > How do I check worker logs? SPARK_HOME/work folder does not exist. I am > using the spark standalone mode. > > On 1 September

Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
No executors ? Please check the worker logs if you are using spark standalone mode. On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa wrote: > Hi All, > > I have successfully submitted some jobs to spark master. But the jobs > won't progress and not finishing. Please see the

Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
Hi All, I have successfully submitted some jobs to spark master. But the jobs won't progress and not finishing. Please see the attached screenshot. These are fairly very small jobs and this shouldn't take more than a minute to finish. I'm new to spark and any help would be appreciated. Thanks,

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
How do I check worker logs? SPARK_HOME/work folder does not exist. I am using the spark standalone mode. On 1 September 2015 at 12:05, Jeff Zhang wrote: > No executors ? Please check the worker logs if you are using spark > standalone mode. > > On Tue, Sep 1, 2015 at 2:17 PM,

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
I cannot see anything abnormal in logs. What would be the reason for not availability of executors? On 1 September 2015 at 12:24, Madawa Soysa wrote: > Following are the logs available. Please find the attached. > > On 1 September 2015 at 12:18, Jeff Zhang

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
Following are the logs available. Please find the attached. On 1 September 2015 at 12:18, Jeff Zhang wrote: > It's in SPARK_HOME/logs > > Or you can check the spark web ui. http://[master-machine]:8080 > > > On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa

Re: Schema From parquet file

2015-09-01 Thread Cheng Lian
What exactly do you mean by "get schema from a parquet file"? - If you are trying to inspect Parquet files, parquet-tools can be pretty neat: https://github.com/Parquet/parquet-mr/issues/321 - If you are trying to get Parquet schema of Parquet MessageType, you may resort to readFooterX() and

Re: extracting file path using dataframes

2015-09-01 Thread Jonathan Coveney
You can make a Hadoop input format which passes through the name of the file. I generally find it easier to just hit Hadoop, get the file names, and construct the RDDs though El martes, 1 de septiembre de 2015, Matt K escribió: > Just want to add - I'm looking to partition

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
Should I use DirectOutputCommitter? spark.hadoop.mapred.output.committer.class com.appsflyer.spark.DirectOutputCommitter On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov wrote: > I run spark 1.4.1 in amazom aws emr 4.0.0 > > For some reason spark saveAsTextFile is

Re: Group by specific key and save as parquet

2015-09-01 Thread Cheng Lian
Starting from Spark 1.4, you can do this via dynamic partitioning: sqlContext.table("trade").write.partitionBy("date").parquet("/tmp/path") Cheng On 9/1/15 8:27 AM, gtinside wrote: Hi , I have a set of data, I need to group by specific key and then save as parquet. Refer to the code snippet

Re: cached data between jobs

2015-09-01 Thread Jeff Zhang
Hi Eric, If the 2 jobs share the same parent stages. these stages can be skipped for the second job. Here's one simple example: val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e)) val rdd2 = rdd1.groupByKey() rdd2.map(e=>e._1).collect() foreach println rdd2.map(e=> (e._1, e._2.size)).collect

Re: Conditionally do things different on the first minibatch vs subsequent minibatches in a dstream

2015-09-01 Thread Ted Yu
Can you utilize the following method in StreamingListener ? override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { Cheers On Tue, Sep 1, 2015 at 12:36 PM, steve_ash wrote: > We have some logic that we need to apply while we are processing the events

Re: extracting file path using dataframes

2015-09-01 Thread Matt K
Just want to add - I'm looking to partition the resulting Parquet files by customer-id, which is why I'm looking to extract the customer-id from the path. On Tue, Sep 1, 2015 at 7:00 PM, Matt K wrote: > Hi all, > > TL;DR - is there a way to extract the source path from an

extracting file path using dataframes

2015-09-01 Thread Matt K
Hi all, TL;DR - is there a way to extract the source path from an RDD via the Scala API? I have sequence files on S3 that look something like this: s3://data/customer=123/... s3://data/customer=456/... I am using Spark Dataframes to convert these sequence files to Parquet. As part of the

Re: Hung spark executors don't count toward worker memory limit

2015-09-01 Thread hai
Hi Keith, we are running into the same issue here with Spark standalone 1.2.1. I was wondering if you have found a solution or workaround. -- View this message in context:

spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I run spark 1.4.1 in amazom aws emr 4.0.0 For some reason spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec) Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Sean Owen
(pedantic: it's the log-probabilities) On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang wrote: > Actually > brzPi + brzTheta * testData.toBreeze > is the probabilities of the input Vector on each class, however it's a > Breeze Vector. > Pay attention the index of this Vector need

Re: Spark shell and StackOverFlowError

2015-09-01 Thread ponkin
Hi, Can not reproduce your error on Spark 1.2.1 . It is not enough information. What is your command line arguments wцру you starting spark-shell? what data are you reading? etc. -- View this message in context:

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
Hi, You just need to extend Partitioner and override the numPartitions and getPartition methods, see below class MyPartitioner extends partitioner { def numPartitions: Int = // Return the number of partitions def getPartition(key Any): Int = // Return the partition for a given key } On Tue,

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
Ah sorry I miss read your question. In pyspark it looks like you just need to instantiate the Partitioner class with numPartitions and partitionFunc. On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf wrote: > Hi > > I did not get this, e.g if i need to create a custom partitioner

HiveThriftServer not registering with Zookeeper

2015-09-01 Thread sreeramvenkat
Hi, I am trying to setup dynamic service discovery for HiveThriftServer in a two node cluster. In the thrift server logs, I am not seeing itself registering with zookeeper - no znode is getting created. Pasting relevant section from my $SPARK_HOME/conf/hive-site.xml hive.zookeeper.quorum

spark 1.5 sort slow

2015-09-01 Thread patcharee
Hi, I found spark 1.5 sorting is very slow compared to spark 1.4. Below is my code snippet val sqlRDD = sql("select date, u, v, z from fino3_hr3 where zone == 2 and z >= 2 and z <= order by date, z") println("sqlRDD " + sqlRDD.count()) The fino3_hr3 (in the sql command) is a hive

Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
Did you start spark cluster using command sbin/start-all.sh ? You should have 2 log files under folder if it is single-node cluster. Like the following spark-jzhang-org.apache.spark.deploy.master.Master-1-jzhangMBPr.local.out

Re: bulk upload to Elasticsearch and shuffle behavior

2015-09-01 Thread Igor Berman
Hi Eric, I see that you solved your problem. Imho, when you do repartition you split your work into 2 stages, so your hbase lookup happens at first stage, and upload to ES happens after shuffle on next stage, so without repartition it's hard to tell where is ES upload and where is Hbase lookup

How to determine the value for spark.sql.shuffle.partitions?

2015-09-01 Thread Romi Kuntsman
Hi all, The number of partition greatly affect the speed and efficiency of calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0. Too few partitions with large data cause OOM exceptions. Too many partitions on small data cause a delay due to overhead. How do you programmatically

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Yanbo Liang
Actually brzPi + brzTheta * testData.toBreeze is the probabilities of the input Vector on each class, however it's a Breeze Vector. Pay attention the index of this Vector need to map to the corresponding label index. 2015-08-28 20:38 GMT+08:00 Adamantios Corais : >

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
I used ./sbin/start-master.sh When I used ./sbin/start-all.sh the start fails. I get the following error. failed to launch org.apache.spark.deploy.master.Master: localhost: ssh: connect to host localhost port 22: Connection refused On 1 September 2015 at 13:41, Jeff Zhang

Custom Partitioner

2015-09-01 Thread shahid qadri
Hi Sparkians How can we create a customer partition in pyspark - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: How to effieciently write sorted neighborhood in pyspark

2015-09-01 Thread shahid qadri
> On Aug 25, 2015, at 10:43 PM, shahid qadri wrote: > > Any resources on this > >> On Aug 25, 2015, at 3:15 PM, shahid qadri wrote: >> >> I would like to implement sorted neighborhood approach in spark, what is the >> best way to write

Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
Hi I did not get this, e.g if i need to create a custom partitioner like range partitioner. On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker wrote: > Hi, > > You just need to extend Partitioner and override the numPartitions and > getPartition methods, see below > > class

Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
Hi I think range partitioner is not available in pyspark, so if we want create one. how should we create that. my question is that. On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker wrote: > Ah sorry I miss read your question. In pyspark it looks like you just need > to

Re: Is it possible to create spark cluster in different network?

2015-09-01 Thread Akhil Das
Did you try with SPARK_LOCAL_IP? Thanks Best Regards On Tue, Sep 1, 2015 at 12:29 AM, sakana wrote: > Hi > > I am successful create Spark cluster in openStack. > I want to create spark cluster in different openStack sites. > > In openstack, if you create instance, it only

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
something like... class RangePartitioner(Partitioner): def __init__(self, numParts): self.numPartitions = numParts self.partitionFunction = rangePartition def rangePartition(key): # Logic to turn key into a partition id return id On Tue, Sep 1, 2015 at 11:38 AM shahid ashraf

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
There are no logs which includes apache.spark.deploy.worker in file name in the SPARK_HOME/logs folder. On 1 September 2015 at 13:00, Jeff Zhang wrote: > This is master log. There's no worker registration info in the log. That > means the worker may not start properly. Please

Re: Spark executor OOM issue on YARN

2015-09-01 Thread ponkin
Hi, Can you please post your stack trace with exceptions? and also command line attributes in spark-submit? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522p24530.html Sent from the Apache Spark User List mailing list

Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
You need to make yourself able to ssh to localhost without password, please check this blog. http://hortonworks.com/kb/generating-ssh-keys-for-passwordless-login/ On Tue, Sep 1, 2015 at 4:31 PM, Madawa Soysa wrote: > I used ./sbin/start-master.sh > > When I used

How mature is spark sql

2015-09-01 Thread rakesh sharma
Is it mature enough to use it extensively. I see that it is easier to do than writing map/reduce in java.We are being asked to do it in java itself and cannot move to python and scala. thanksrakesh

Schema From parquet file

2015-09-01 Thread Hafiz Mujadid
Hi all! Is there any way to get schema from a parquet file without loading into dataframe? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Schema-From-parquet-file-tp24535.html Sent from the Apache Spark User List mailing list archive at

reading multiple parquet file using spark sql

2015-09-01 Thread Hafiz Mujadid
Hi I want to read multiple parquet files using spark sql load method. just like we can pass multiple comma separated path to sc.textfile method. Is ther anyway to do the same ? Thanks -- View this message in context:

spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Hi In spark streaming 1.3 with kafka- when does driver bring latest offsets of this run - at start of each batch or at time when batch gets queued ? Say few of my batches take longer time to complete than their batch interval. So some of batches will go in queue. Will driver waits for queued

Spark job killed

2015-09-01 Thread Silvio Bernardinello
Hi, We are running Spark 1.4.0 on a Mesosphere cluster (~250GB memory with 16 activated hosts). Spark jobs are submitted in coarse mode. Suddenly, our jobs get killed without any error. ip-10-0-2-193.us-west-2.compute.internal, PROCESS_LOCAL, 1514 bytes) 15/09/01 10:48:24 INFO TaskSetManager:

Re: Spark job killed

2015-09-01 Thread Akhil Das
If it is not some other user then its the kernal triggering the kill, it might be using way too much memory or swap. Check your resource usage while the job is running and see the memory overhead etc. Thanks Best Regards On Tue, Sep 1, 2015 at 5:56 PM, Silvio Bernardinello <

Re: Potential NPE while exiting spark-shell

2015-09-01 Thread nasokan
bump -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Potential-NPE-while-exiting-spark-shell-tp24523p24539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Error using spark.driver.userClassPathFirst=true

2015-09-01 Thread cgalan
Hi, When I am submitting a spark job in the mode "yarn-cluster" with the parameter "spark.driver.userClassPathFirst", my job fails; but if I don't use this params, my job is concluded with success.. My environment is some nodes with CDH5.4 and Spark 1.3.0. Spark submit with fail: spark-submit

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I checked previous emr config (emr-3.8) mapred-site.xml has the following setting mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov wrote: > Should I use DirectOutputCommitter? >

Spark + Druid

2015-09-01 Thread Harish Butani
Hi, I am working on the Spark Druid Package: https://github.com/SparklineData/spark-druid-olap. For scenarios where a 'raw event' dataset is being indexed in Druid it enables you to write your Logical Plans(queries/dataflows) against the 'raw event' dataset and it rewrites parts of the plan to

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Since in my app , after processing the events I am posting the events to some external server- if external server is down - I want to backoff consuming from kafka. But I can't stop and restart the consumer since it needs manual effort. Backing off few batches is also not possible -since decision

Error when creating an ALS model in spark

2015-09-01 Thread Madawa Soysa
I'm getting the an error when I try to build an ALS model in spark standalone. I am new to spark. Any help would be appreciated to resolve this issue. Stack Trace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
Sounds like you'd be better off just failing if the external server is down, and scripting monitoring / restarting of your job. On Tue, Sep 1, 2015 at 11:19 AM, Shushant Arora wrote: > Since in my app , after processing the events I am posting the events to > some

cached data between jobs

2015-09-01 Thread Eric Walker
Hi, I'm noticing that a 30 minute job that was initially IO-bound may not be during subsequent runs. Is there some kind of between-job caching that happens in Spark or in Linux that outlives jobs and that might be making subsequent runs faster? If so, is there a way to avoid the caching in

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
Hi Jeff, I solved the issue by following the given instructions. Thanks for the help. Regards, Madawa. On 1 September 2015 at 14:12, Jeff Zhang wrote: > You need to make yourself able to ssh to localhost without password, > please check this blog. > >

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-01 Thread Anders Arpteg
A fix submitted less than one hour after my mail, very impressive Davies! I've compiled your PR and tested it with the large job that failed before, and it seems to work fine now without any exceptions. Awesome, thanks! Best, Anders On Tue, Sep 1, 2015 at 1:38 AM Davies Liu

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
can I reset the range based on some condition - before calling transformations on the stream. Say - before calling : directKafkaStream.foreachRDD(new Function, Void>() { @Override public Void call(JavaRDD v1) throws Exception { v1.foreachPartition(new

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
No, if you start arbitrarily messing around with offset ranges after compute is called, things are going to get out of whack. e.g. checkpoints are no longer going to correspond to what you're actually processing On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora wrote:

Re: Memory-efficient successive calls to repartition()

2015-09-01 Thread Aurélien Bellet
Dear Alexis, Thanks again for your reply. After reading about checkpointing I have modified my sample code as follows: for i in range(1000): print i data2=data.repartition(50).cache() if (i+1) % 10 == 0: data2.checkpoint() data2.first() # materialize rdd

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
What if I use custom checkpointing. So that I can take care of offsets being checkpointed at end of each batch. Will it be possible then to reset the offset. On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger wrote: > No, if you start arbitrarily messing around with offset

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
It's at the time compute() gets called, which should be near the time the batch should have been queued. On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora wrote: > Hi > > In spark streaming 1.3 with kafka- when does driver bring latest offsets > of this run - at start of

Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Hello everybody, I am trying to read the Google Books Ngrams with pyspark on Amazon EC2. I followed the steps from : http://spark.apache.org/docs/latest/ec2-scripts.html and everything is working fine. I am able to read the file : lines =

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
Honestly I'd concentrate more on getting your batches to finish in a timely fashion, so you won't even have the issue to begin with... On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora wrote: > What if I use custom checkpointing. So that I can take care of offsets >

Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
The web ui is at port 8080. 4040 will show up something when you have a running job or if you have configured history server. On Sep 1, 2015 8:57 PM, "Sunil Rathee" wrote: > > Hi, > > > localhost:4040 is not showing anything on the browser. Do we have to start > some

Re: Web UI is not showing up

2015-09-01 Thread Sunil Rathee
localhost:8080 is also not showing anything. Does some application running at the same time? On Tue, Sep 1, 2015 at 9:04 PM, Sonal Goyal wrote: > The web ui is at port 8080. 4040 will show up something when you have a > running job or if you have configured history

Re: Is it possible to create spark cluster in different network?

2015-09-01 Thread Max Huang
Thanks for reply my topic Yes, I try SPARK_MASTER_IP and SPARK_LOCAL_IP in spark-env.sh ^^ 2015-09-01 5:47 GMT-05:00 Akhil Das : > Did you try with SPARK_LOCAL_IP? > > Thanks > Best Regards > > On Tue, Sep 1, 2015 at 12:29 AM, sakana wrote: >

Web UI is not showing up

2015-09-01 Thread Sunil Rathee
Hi, localhost:4040 is not showing anything on the browser. Do we have to start some service? -- Sunil Rathee

Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
Is your master up? Check the java processes to see if they are running. Best Regards, Sonal Founder, Nube Technologies Reifier covered in YourStory Reifier at Spark Summit 2015

What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partition from 200 default to 1000 but it is not helping. Please correct me if I am wrong this partitions will share data shuffle