Re: Executor lost failure

2015-09-01 Thread Andrew Duffy
If you're using YARN with Spark 1.3.1, you could be running into
https://issues.apache.org/jira/browse/SPARK-8119, although without more
information it's impossible to know.

On Tue, Sep 1, 2015 at 11:28 AM, Priya Ch 
wrote:

> Hi All,
>
>  I have a spark streaming application which writes the processed results
> to cassandra. In local mode, the code seems to work fine. The moment i
> start running in distributed mode using yarn, i see executor lost failure.
> I increased executor memory to occupy entire node's memory which is around
> 12gb/ But still see the same issue.
>
> What could be the possible scenarios for executor lost failure ?
>


Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
I feel need of pause and resume in streaming app :)

Is there any limit on max queued jobs ? If yes what happens once that limit
reaches? Does job gets killed?


On Tue, Sep 1, 2015 at 10:02 PM, Cody Koeninger  wrote:

> Sounds like you'd be better off just failing if the external server is
> down, and scripting monitoring / restarting of your job.
>
> On Tue, Sep 1, 2015 at 11:19 AM, Shushant Arora  > wrote:
>
>> Since in my app , after processing the events I am posting the events to
>> some external server- if external server is down - I want to backoff
>> consuming from kafka. But I can't stop and restart the consumer since it
>> needs manual effort.
>>
>> Backing off few batches is also not possible -since decision of backoff
>> is based on last batch process status but driver has already computed
>> offsets for next batches - so if I ignore further few batches till external
>> server is back to normal its a dataloss if I cannot reset the offset .
>>
>> So only option seems is to delay the last batch by calling sleep() in
>> foreach rdd method after returning from foreachpartitions transformations.
>>
>> So concern here is further batches will keep enqueening until current
>> slept batch completes. So whats the max size of scheduling queue? Say if
>> server does not come up for hours and my batch size is 5 sec it will
>> enqueue 720 batches .
>> Will that be a issue ?
>>  And is there any setting in directkafkastream to enforce not to call
>> further computes() method after a threshold of scheduling queue size say
>> (50 batches).Once queue size comes back to less than threshold call compute
>> and enqueue the next job.
>>
>>
>>
>>
>>
>> On Tue, Sep 1, 2015 at 8:57 PM, Cody Koeninger 
>> wrote:
>>
>>> Honestly I'd concentrate more on getting your batches to finish in a
>>> timely fashion, so you won't even have the issue to begin with...
>>>
>>> On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
 What if I use custom checkpointing. So that I can take care of offsets
 being checkpointed at end of each batch.

 Will it be possible then to reset the offset.

 On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger 
 wrote:

> No, if you start arbitrarily messing around with offset ranges after
> compute is called, things are going to get out of whack.
>
> e.g. checkpoints are no longer going to correspond to what you're
> actually processing
>
> On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> can I reset the range based on some condition - before calling
>> transformations on the stream.
>>
>> Say -
>> before calling :
>>  directKafkaStream.foreachRDD(new Function, Void>()
>> {
>>
>> @Override
>> public Void call(JavaRDD v1) throws Exception {
>> v1.foreachPartition(new  VoidFunction>{
>> @Override
>> public void call(Iterator t) throws Exception {
>> }});}});
>>
>> change directKafkaStream's RDD's offset range.(fromOffset).
>>
>> I can't do this in compute method since compute would have been
>> called at current batch queue time - but condition is set at previous 
>> batch
>> run time.
>>
>>
>> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger 
>> wrote:
>>
>>> It's at the time compute() gets called, which should be near the
>>> time the batch should have been queued.
>>>
>>> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
 Hi

 In spark streaming 1.3 with kafka- when does driver bring latest
 offsets of this run - at start of each batch or at time when  batch 
 gets
 queued ?

 Say few of my batches take longer time to complete than their batch
 interval. So some of batches will go in queue. Will driver waits for
  queued batches to get started or just brings the latest offsets before
 they even actually started. And when they start running they will work 
 on
 old offsets brought at time when they were queued.


>>>
>>
>

>>>
>>
>


Re: How to avoid shuffle errors for a large join ?

2015-09-01 Thread Thomas Dudziak
While it works with sort-merge-join, it takes about 12h to finish (with
1 shuffle partitions). My hunch is that the reason for that is this:

INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB to disk
(62 times so far)

(and lots more where this comes from).

On Sat, Aug 29, 2015 at 7:17 PM, Reynold Xin  wrote:

> Can you try 1.5? This should work much, much better in 1.5 out of the box.
>
> For 1.4, I think you'd want to turn on sort-merge-join, which is off by
> default. However, the sort-merge join in 1.4 can still trigger a lot of
> garbage, making it slower. SMJ performance is probably 5x - 1000x better in
> 1.5 for your case.
>
>
> On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak  wrote:
>
>> I'm getting errors like "Removing executor with no recent heartbeats" &
>> "Missing an output location for shuffle" errors for a large SparkSql join
>> (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to
>> configure the job to avoid them.
>>
>> The initial stage completes fine with some 30k tasks on a cluster with 70
>> machines/10TB memory, generating about 6.5TB of shuffle writes, but then
>> the shuffle stage first waits 30min in the scheduling phase according to
>> the UI, and then dies with the mentioned errors.
>>
>> I can see in the GC logs that the executors reach their memory limits
>> (32g per executor, 2 workers per machine) and can't allocate any more stuff
>> in the heap. Fwiw, the top 10 in the memory use histogram are:
>>
>> num #instances #bytes  class name
>> --
>>1: 24913959511958700560
>>  scala.collection.immutable.HashMap$HashMap1
>>2: 251085327 8034730464  scala.Tuple2
>>3: 243694737 5848673688  java.lang.Float
>>4: 231198778 5548770672  java.lang.Integer
>>5:  72191585 4298521576  [Lscala.collection.immutable.HashMap;
>>6:  72191582 2310130624
>>  scala.collection.immutable.HashMap$HashTrieMap
>>7:  74114058 1778737392  java.lang.Long
>>8:   6059103  779203840  [Ljava.lang.Object;
>>9:   5461096  174755072  scala.collection.mutable.ArrayBuffer
>>   10: 34749   70122104  [B
>>
>> Relevant settings are (Spark 1.4.1, Java 8 with G1 GC):
>>
>> spark.core.connection.ack.wait.timeout 600
>> spark.executor.heartbeatInterval   60s
>> spark.executor.memory  32g
>> spark.mesos.coarse false
>> spark.network.timeout  600s
>> spark.shuffle.blockTransferService netty
>> spark.shuffle.consolidateFiles true
>> spark.shuffle.file.buffer  1m
>> spark.shuffle.io.maxRetries6
>> spark.shuffle.manager  sort
>>
>> The join is currently configured with spark.sql.shuffle.partitions=1000
>> but that doesn't seem to help. Would increasing the partitions help ? Is
>> there a formula to determine an approximate partitions number value for a
>> join ?
>> Any help with this job would be appreciated !
>>
>> cheers,
>> Tom
>>
>
>


Executor lost failure

2015-09-01 Thread Priya Ch
Hi All,

 I have a spark streaming application which writes the processed results to
cassandra. In local mode, the code seems to work fine. The moment i start
running in distributed mode using yarn, i see executor lost failure. I
increased executor memory to occupy entire node's memory which is around
12gb/ But still see the same issue.

What could be the possible scenarios for executor lost failure ?


RE: What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Thank you, so I was inversely confused. At first I thought MLLIB was the 
future, but based on what you say. MLLIB will be the past. Intersting.
This means that if I look forward over using the pipelines system, I shouldn't 
be obsolete.

Any more insights welcome,
Saif

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Tuesday, September 01, 2015 3:31 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: What is the current status of ML ?

I think the story is that the new spark.ml "pipelines" API is the future. Most 
(all?) of the spark.mllib functionality has been ported over and/or translated. 
I don't know that spark.mllib will actually be deprecated soon -- not until 
spark.ml is fully blessed as 'stable' I'd imagine, at least. Even if it were I 
don't think it would go away. You can use spark.mllib now as it is pretty 
stable with some confidence, and look to spark.ml if you're interested in the 
"2.0" of MLlib and are willing to work with APIs that may change a bit.

On Tue, Sep 1, 2015 at 7:23 PM,   wrote:
> Hi all,
>
> I am little bit confused, as getting introduced to Spark recently. 
> What is going on with ML? Is it going to be deprecated? Or are all of 
> its features valid and constructed over? It has a set of features and 
> ML Constructors which I like to use, but need to confirm wether the 
> future of these functionalities will be valid upon the future.
> I am reading here and there different calls on this, being on the 
> official site that all contributions should go to MLLIB, and even that 
> ML uses MLLIB already.
>
> Saif
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading xml in java using spark

2015-09-01 Thread Darin McBeath
Another option might be to leverage spark-xml-utils 
(https://github.com/dmcbeath/spark-xml-utils)

This is a collection of xml utilities that I've recently revamped that make it 
relatively easy to use xpath, xslt, or xquery within the context of a Spark 
application (or at least I think so).  My previous attempt was not overly 
friendly, but as I've learned more about Spark (and needed easier to use xml 
utilities) I've hopefully made this easier to use and understand.  I hope 
others find it useful.

Back to your problem.  Assuming you have a bunch of xml records in an RDD, you 
should be able to do something like the following to count the number of 
elements for that type.  In the example below, I'm counting the number of 
references in documents.  The xmlKeyPair is an RDD of type (String,String) 
where the first item is the 'key' and the second item is the xml record.  The 
xpath expression identifies the 'reference' element I want to count.

import com.elsevier.spark_xml_utils.xpath.XPathProcessor
import scala.collection.JavaConverters._
import java.util.HashMap

xmlKeyPair.mapPartitions(recsIter => {
 val xpath = 
"count(/xocs:doc/xocs:meta/xocs:references/xocs:ref-info)"
 val namespaces = new HashMap[String,String](Map(
"xocs" -> 
"http://www.elsevier.com/xml/xocs/dtd;
  ).asJava)
 val proc = XPathProcessor.getInstance(xpath,namespaces)
 recsIter.map(rec => proc.evaluateString(rec._2).toInt)
   }).sum


There is more documentation on the spark-xml-utils github site.  Let me know if 
the documentation is not clear or if you have any questions. 

Darin.



From: Rick Hillegas 
To: Sonal Goyal  
Cc: rakesh sharma ; user@spark.apache.org 
Sent: Monday, August 31, 2015 10:51 AM
Subject: Re: Reading xml in java using spark



Hi Rakesh,

You might also take a look at the Derby code.
   org.apache.derby.vti.XmlVTI provides a number of static methods for
   turning an XML resource into a JDBC ResultSet.

Thanks,
-Rick

On 8/31/15 4:44 AM, Sonal Goyal wrote: 


I think the mahout project had an xmlinoutformat which you can leverage.
>On Aug 31, 2015 5:10 PM, "rakesh sharma"  wrote:
>
>I want to parse an xml file in spark 
>>But as far as example is concerned it reads it as text file. The maping to 
>>xml will be a tedious job.
>>How can I find the number of elements of a particular type using that. Any 
>>help in java/scala code is also welcome
>>
>>
>>thanks
>>rakesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What is the current status of ML ?

2015-09-01 Thread Sean Owen
I think the story is that the new spark.ml "pipelines" API is the
future. Most (all?) of the spark.mllib functionality has been ported
over and/or translated. I don't know that spark.mllib will actually be
deprecated soon -- not until spark.ml is fully blessed as 'stable' I'd
imagine, at least. Even if it were I don't think it would go away. You
can use spark.mllib now as it is pretty stable with some confidence,
and look to spark.ml if you're interested in the "2.0" of MLlib and
are willing to work with APIs that may change a bit.

On Tue, Sep 1, 2015 at 7:23 PM,   wrote:
> Hi all,
>
> I am little bit confused, as getting introduced to Spark recently. What is
> going on with ML? Is it going to be deprecated? Or are all of its features
> valid and constructed over? It has a set of features and ML Constructors
> which I like to use, but need to confirm wether the future of these
> functionalities will be valid upon the future.
> I am reading here and there different calls on this, being on the official
> site that all contributions should go to MLLIB, and even that ML uses MLLIB
> already.
>
> Saif
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Resource allocation in SPARK streaming

2015-09-01 Thread anshu shukla
I am not much clear about  resource allocation (CPU/CORE/Thread  level
allocation)  as per the parallelism by  setting  number of cores in  spark
 standalone mode .

Any guidelines for that .

-- 
Thanks & Regards,
Anshu Shukla


Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Thanks for your prompt reply.

I will follow https://issues.apache.org/jira/browse/SPARK-2394 and will let
you know if everything works.

Cheers,

Bertrand



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24545.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Hello everybody,

I followed the steps from https://issues.apache.org/jira/browse/SPARK-2394
to read LZO-compressed files, but now I cannot even open a file with :

lines =
sc.textFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram")


>>> lines.first()
Traceback (most recent call last):
  File "", line 1, in 
  File "/root/spark/python/pyspark/rdd.py", line 1295, in first
rs = self.take(1)
  File "/root/spark/python/pyspark/rdd.py", line 1247, in take
totalParts = self.getNumPartitions()
  File "/root/spark/python/pyspark/rdd.py", line 355, in getNumPartitions
return self._jrdd.partitions().size()
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.lang.RuntimeException: Error in configuring object




lines =
sc.sequenceFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram")

Traceback (most recent call last):
  File "", line 1, in 
  File "/root/spark/python/pyspark/context.py", line 544, in sequenceFile
keyConverter, valueConverter, minSplits, batchSize)
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.sequenceFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, 172.31.12.23): java.lang.IllegalArgumentException: Unknown codec:
com.hadoop.compression.lzo.LzoCodec





Thanks for your help,

Cheers,

Bertrand




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24546.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Robineast
Do you have LZO configured? see
http://stackoverflow.com/questions/14808041/how-to-have-lzo-compression-in-hadoop-mapreduce

---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/malak/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542p24544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Intermittent performance degradation in Spark Streaming

2015-09-01 Thread Michael Siler
Hello,

I'm running a small Spark Streaming instance: 4 node cluster, 1000
records per second coming in. For each record, I'm querying Cassandra,
updating some very simple stats, and sending the results back to
Cassandra. I'm using 10 second mini-batches, and it typically takes 8
seconds to process them. However, every so often a mini-batch will
take significantly longer to process -- generally between 30 seconds
and 2+ minutes (I'm rate-limiting my input, so the mini-batch size
isn't changing). It might take 5 minutes to come across one of these
strange batches or 45 minutes, but they always happen.

I've been trying to debug this problem for a while now, but I'm
getting stuck. I've turned on spark's debug logging and look through
the logs during one of these slow downs. I found that 3 of my 4
executor nodes were completely silent on the logs -- the timestamp
just jumped 30 seconds between consecutive lines. Before the silence,
every task assigned to those nodes finished an no new ones were
assigned. Then at the end of that 30 seconds, the logs show them
receiving an akka message (shown below), then they start running tasks
again and things go back to normal.


10:34:54,072 DEBUG -- sparkExecutor-akka.actor.default-dispatcher-6
akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1 - [actor]
received message
AkkaMessage(LaunchTask(org.apache.spark.util.SerializableBuffer@700b204a),false)
from Actor[akka://sparkExecutor/deadLetters]
10:34:54,072 DEBUG -- sparkExecutor-akka.actor.default-dispatcher-6
akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1 - Received
RPC message: 
AkkaMessage(LaunchTask(org.apache.spark.util.SerializableBuffer@700b204a),false)
10:34:54,073  INFO -- sparkExecutor-akka.actor.default-dispatcher-6
executor.CoarseGrainedExecutorBackend - Got assigned task 151600
10:34:54,073 DEBUG -- sparkExecutor-akka.actor.default-dispatcher-6
akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1 - [actor]
handled message (0.215506 ms)
AkkaMessage(LaunchTask(org.apache.spark.util.SerializableBuffer@700b204a),false)
from Actor[akka://sparkExecutor/deadLetters]
10:34:54,073 DEBUG -- sparkExecutor-akka.actor.default-dispatcher-6
akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1 - [actor]
received message
AkkaMessage(LaunchTask(org.apache.spark.util.SerializableBuffer@10793472),false)
from Actor[akka://sparkExecutor/deadLetters]
10:34:54,073 DEBUG -- sparkExecutor-akka.actor.default-dispatcher-6
akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1 - Received
RPC message: 
AkkaMessage(LaunchTask(org.apache.spark.util.SerializableBuffer@10793472),false)
10:34:54,073  INFO -- Executor task launch worker-1
executor.Executor - Running task 2.0 in stage 16466.0 (TID 151600)



As I mentioned, that is on 3 of the 4 executors. On the driver, during
that 30 second period, his logs only show him taking in input making
blocks for it, and sending those blocks to the 4th executor. That 4th
executor's logs during that period show him doing no tasks and instead
only receiving those blocks. At the end of that time period, the
driver and executor suddenly start doing work again just like the
other executors.

Can anyone give me any ideas (even vague ones!) of things to check to
figure out what's happening here?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-09-01 Thread Krishna Sangeeth KS
Hi Timothy,

I think the driver memory in all your examples is more than what is
necessary in usual cases and executor memory is quite less.

I found this devops talk[1] at spark-summit here to be super useful in
understanding few of this configuration details.

[1] https://.youtube.com/watch?v=l4ZYUfZuRbU

Cheers,
Sangeeth
On Aug 30, 2015 7:28 AM, "timothy22000"  wrote:

> I am doing some memory tuning on my Spark job on YARN and I notice
> different
> settings would give different results and affect the outcome of the Spark
> job run. However, I am confused and do not understand completely why it
> happens and would appreciate if someone can provide me with some guidance
> and explanation.
>
> I will provide some background information and describe the cases that I
> have experienced and post my questions after them below.
>
> *My environment setting were as below:*
>
>  - Memory 20G, 20 VCores per node (3 nodes in total)
>  - Hadoop 2.6.0
>  - Spark 1.4.0
>
> My code recursively filters an RDD to make it smaller (removing examples as
> part of an algorithm), then does mapToPair and collect to gather the
> results
> and save them within a list.
>
>  First Case
>
> /`/bin/spark-submit --class  --master yarn-cluster
> --driver-memory 7g --executor-memory 1g --num-executors 3 --executor-cores
> 1
> --jars `
> /
> If I run my program with any driver memory less than 11g, I will get the
> error below which is the SparkContext being stopped or a similar error
> which
> is a method being called on a stopped SparkContext. From what I have
> gathered, this is related to memory not being enough.
>
>
>  >
>
> Second Case
>
>
> /`/bin/spark-submit --class  --master yarn-cluster
> --driver-memory 7g --executor-memory 3g --num-executors 3 --executor-cores
> 1
> --jars `/
>
> If I run the program with the same driver memory but higher executor
> memory,
> the job runs longer (about 3-4 minutes) than the first case and then it
> will
> encounter a different error from earlier which is a Container
> requesting/using more memory than allowed and is being killed because of
> that. Although I find it weird since the executor memory is increased and
> this error occurs instead of the error in the first case.
>
>  >
>
> Third Case
>
>
> /`/bin/spark-submit --class  --master yarn-cluster
> --driver-memory 11g --executor-memory 1g --num-executors 3 --executor-cores
> 1 --jars `/
>
> Any setting with driver memory greater than 10g will lead to the job being
> able to run successfully.
>
> Fourth Case
>
>
> /`/bin/spark-submit --class  --master yarn-cluster
> --driver-memory 2g --executor-memory 1g --conf
> spark.yarn.executor.memoryOverhead=1024 --conf
> spark.yarn.driver.memoryOverhead=1024 --num-executors 3 --executor-cores 1
> --jars `
> /
> The job will run successfully with this setting (driver memory 2g and
> executor memory 1g but increasing the driver memory overhead(1g) and the
> executor memory overhead(1g).
>
> Questions
>
>
>  1. Why is a different error thrown and the job runs longer (for the second
> case) between the first and second case with only the executor memory being
> increased? Are the two errors linked in some way?
>
>  2. Both the third and fourth case succeeds and I understand that it is
> because I am giving more memory which solves the memory problems. However,
> in the third case,
>
> /spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory that
> YARN will create a JVM
> = 11g + (driverMemory * 0.07, with minimum of 384m)
> = 11g + 1.154g
> = 12.154g/
>
> So, from the formula, I can see that my job requires MEMORY_TOTAL of around
> 12.154g to run successfully which explains why I need more than 10g for the
> driver memory setting.
>
> But for the fourth case,
>
> /
> spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory that
> YARN will create a JVM
> = 2 + (driverMemory * 0.07, with minimum of 384m)
> = 2g + 0.524g
> = 2.524g
> /
>
> It seems that just by increasing the memory overhead by a small amount of
> 1024(1g) it leads to the successful run of the job with driver memory of
> only 2g and the MEMORY_TOTAL is only 2.524g! Whereas without the overhead
> configuration, driver memory less than 11g fails but it doesn't make sense
> from the formula which is why I am confused.
>
> Why increasing the memory overhead (for both driver and executor) allows my
> job to complete successfully with a lower MEMORY_TOTAL (12.154g vs 2.524g)?
> Is there some other internal things at work here that I am missing?
>
> I would really appreciate any helped offered as it would really help with
> my
> understanding of Spark. Thanks in advance.
>
>
>
> --
> View this message in context:
> 

What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Hi all,

I am little bit confused, as getting introduced to Spark recently. What is 
going on with ML? Is it going to be deprecated? Or are all of its features 
valid and constructed over? It has a set of features and ML Constructors which 
I like to use, but need to confirm wether the future of these functionalities 
will be valid upon the future.
I am reading here and there different calls on this, being on the official site 
that all contributions should go to MLLIB, and even that ML uses MLLIB already.

Saif



Conditionally do things different on the first minibatch vs subsequent minibatches in a dstream

2015-09-01 Thread steve_ash
We have some logic that we need to apply while we are processing the events
in the first minibatch only.  For the second, third, etc. minibatches we
don't need to do this special logic.  I can't just do it as a one time thing
- I need to modify a field on the events in the first minibatch.

One approach:
1- create inp = inputDStream
2- call inp.foreachRdd(rdd -> rdd.foreachPartition( isFirst = true ) )
3- do m1 = inp.map( if (isFirst) setStuffOnEvent )
4- do m2 = m1.mapPartition( isFirst = false )
5- do m2.foreachRdd ( rest of my stuff)

since foreachRdd in (2) is an action that just happens once right at the
beginning? Or does this happen for every minibatch just like the
foreachRdd() in (5) does?

What is another way that we accomplish this need of setting something on the
events in the first minibatch and not others?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Conditionally-do-things-different-on-the-first-minibatch-vs-subsequent-minibatches-in-a-dstream-tp24547.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How mature is spark sql

2015-09-01 Thread Jörn Franke
Depends on what you need to do. Can you tell more about your use cases?

Le mar. 1 sept. 2015 à 13:07, rakesh sharma  a
écrit :

> Is it mature enough to use it extensively. I see that it is easier to do
> than writing map/reduce  in java.
> We are being asked to do it in java itself and cannot move to python and
> scala.
>
> thanks
> rakesh
>


Re: Custom Partitioner

2015-09-01 Thread Davies Liu
You can take the sortByKey as example:
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642

On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker  wrote:
> something like...
>
> class RangePartitioner(Partitioner):
> def __init__(self, numParts):
> self.numPartitions = numParts
> self.partitionFunction = rangePartition
> def rangePartition(key):
> # Logic to turn key into a partition id
> return id
>
> On Tue, Sep 1, 2015 at 11:38 AM shahid ashraf  wrote:
>>
>> Hi
>>
>> I think range partitioner is not available in pyspark, so if we want
>> create one. how should we create that. my question is that.
>>
>> On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker  wrote:
>>>
>>> Ah sorry I miss read your question. In pyspark it looks like you just
>>> need to instantiate the Partitioner class with numPartitions and
>>> partitionFunc.
>>>
>>> On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf  wrote:

 Hi

 I did not get this, e.g if i need to create a custom partitioner like
 range partitioner.

 On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:
>
> Hi,
>
> You just need to extend Partitioner and override the numPartitions and
> getPartition methods, see below
>
> class MyPartitioner extends partitioner {
>   def numPartitions: Int = // Return the number of partitions
>   def getPartition(key Any): Int = // Return the partition for a given
> key
> }
>
> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
> wrote:
>>
>> Hi Sparkians
>>
>> How can we create a customer partition in pyspark
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>



 --
 with Regards
 Shahid Ashraf
>>
>>
>>
>>
>> --
>> with Regards
>> Shahid Ashraf

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
This is master log. There's no worker registration info in the log. That
means the worker may not start properly. Please check the log file
with apache.spark.deploy.worker in file name.



On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa 
wrote:

> I cannot see anything abnormal in logs. What would be the reason for not
> availability of executors?
>
> On 1 September 2015 at 12:24, Madawa Soysa 
> wrote:
>
>> Following are the logs available. Please find the attached.
>>
>> On 1 September 2015 at 12:18, Jeff Zhang  wrote:
>>
>>> It's in SPARK_HOME/logs
>>>
>>> Or you can check the spark web ui. http://[master-machine]:8080
>>>
>>>
>>> On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa 
>>> wrote:
>>>
 How do I check worker logs? SPARK_HOME/work folder does not exist. I am
 using the spark standalone mode.

 On 1 September 2015 at 12:05, Jeff Zhang  wrote:

> No executors ? Please check the worker logs if you are using spark
> standalone mode.
>
> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa 
> wrote:
>
>> Hi All,
>>
>> I have successfully submitted some jobs to spark master. But the jobs
>> won't progress and not finishing. Please see the attached screenshot. 
>> These
>> are fairly very small jobs and this shouldn't take more than a minute to
>> finish.
>>
>> I'm new to spark and any help would be appreciated.
>>
>> Thanks,
>> Madawa.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



 --

 *_**Madawa Soysa*

 Undergraduate,

 Department of Computer Science and Engineering,

 University of Moratuwa.


 Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
 madawa...@cse.mrt.ac.lk
 LinkedIn  | Twitter
  | Tumblr 

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of Moratuwa.
>>
>>
>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>> madawa...@cse.mrt.ac.lk
>> LinkedIn  | Twitter
>>  | Tumblr 
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



-- 
Best Regards

Jeff Zhang


Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
It's in SPARK_HOME/logs

Or you can check the spark web ui. http://[master-machine]:8080


On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa 
wrote:

> How do I check worker logs? SPARK_HOME/work folder does not exist. I am
> using the spark standalone mode.
>
> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>
>> No executors ? Please check the worker logs if you are using spark
>> standalone mode.
>>
>> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have successfully submitted some jobs to spark master. But the jobs
>>> won't progress and not finishing. Please see the attached screenshot. These
>>> are fairly very small jobs and this shouldn't take more than a minute to
>>> finish.
>>>
>>> I'm new to spark and any help would be appreciated.
>>>
>>> Thanks,
>>> Madawa.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



-- 
Best Regards

Jeff Zhang


Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
No executors ? Please check the worker logs if you are using spark
standalone mode.

On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa 
wrote:

> Hi All,
>
> I have successfully submitted some jobs to spark master. But the jobs
> won't progress and not finishing. Please see the attached screenshot. These
> are fairly very small jobs and this shouldn't take more than a minute to
> finish.
>
> I'm new to spark and any help would be appreciated.
>
> Thanks,
> Madawa.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Best Regards

Jeff Zhang


Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
Hi All,

I have successfully submitted some jobs to spark master. But the jobs won't
progress and not finishing. Please see the attached screenshot. These are
fairly very small jobs and this shouldn't take more than a minute to finish.

I'm new to spark and any help would be appreciated.

Thanks,
Madawa.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
How do I check worker logs? SPARK_HOME/work folder does not exist. I am
using the spark standalone mode.

On 1 September 2015 at 12:05, Jeff Zhang  wrote:

> No executors ? Please check the worker logs if you are using spark
> standalone mode.
>
> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa 
> wrote:
>
>> Hi All,
>>
>> I have successfully submitted some jobs to spark master. But the jobs
>> won't progress and not finishing. Please see the attached screenshot. These
>> are fairly very small jobs and this shouldn't take more than a minute to
>> finish.
>>
>> I'm new to spark and any help would be appreciated.
>>
>> Thanks,
>> Madawa.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 

*_**Madawa Soysa*

Undergraduate,

Department of Computer Science and Engineering,

University of Moratuwa.


Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
madawa...@cse.mrt.ac.lk
LinkedIn  | Twitter
 | Tumblr 


Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
I cannot see anything abnormal in logs. What would be the reason for not
availability of executors?

On 1 September 2015 at 12:24, Madawa Soysa  wrote:

> Following are the logs available. Please find the attached.
>
> On 1 September 2015 at 12:18, Jeff Zhang  wrote:
>
>> It's in SPARK_HOME/logs
>>
>> Or you can check the spark web ui. http://[master-machine]:8080
>>
>>
>> On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa 
>> wrote:
>>
>>> How do I check worker logs? SPARK_HOME/work folder does not exist. I am
>>> using the spark standalone mode.
>>>
>>> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>>>
 No executors ? Please check the worker logs if you are using spark
 standalone mode.

 On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa 
 wrote:

> Hi All,
>
> I have successfully submitted some jobs to spark master. But the jobs
> won't progress and not finishing. Please see the attached screenshot. 
> These
> are fairly very small jobs and this shouldn't take more than a minute to
> finish.
>
> I'm new to spark and any help would be appreciated.
>
> Thanks,
> Madawa.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>
>>> *_**Madawa Soysa*
>>>
>>> Undergraduate,
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> University of Moratuwa.
>>>
>>>
>>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>>> madawa...@cse.mrt.ac.lk
>>> LinkedIn  | Twitter
>>>  | Tumblr 
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



-- 

*_**Madawa Soysa*

Undergraduate,

Department of Computer Science and Engineering,

University of Moratuwa.


Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
madawa...@cse.mrt.ac.lk
LinkedIn  | Twitter
 | Tumblr 


Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
Following are the logs available. Please find the attached.

On 1 September 2015 at 12:18, Jeff Zhang  wrote:

> It's in SPARK_HOME/logs
>
> Or you can check the spark web ui. http://[master-machine]:8080
>
>
> On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa 
> wrote:
>
>> How do I check worker logs? SPARK_HOME/work folder does not exist. I am
>> using the spark standalone mode.
>>
>> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>>
>>> No executors ? Please check the worker logs if you are using spark
>>> standalone mode.
>>>
>>> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa 
>>> wrote:
>>>
 Hi All,

 I have successfully submitted some jobs to spark master. But the jobs
 won't progress and not finishing. Please see the attached screenshot. These
 are fairly very small jobs and this shouldn't take more than a minute to
 finish.

 I'm new to spark and any help would be appreciated.

 Thanks,
 Madawa.


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of Moratuwa.
>>
>>
>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>> madawa...@cse.mrt.ac.lk
>> LinkedIn  | Twitter
>>  | Tumblr 
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 

*_**Madawa Soysa*

Undergraduate,

Department of Computer Science and Engineering,

University of Moratuwa.


Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
madawa...@cse.mrt.ac.lk
LinkedIn  | Twitter
 | Tumblr 
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
/opt/spark-1.4.1-bin-hadoop2.6/ml/org.wso2.carbon.ml.core_1.0.1.SNAPSHOT.jar:/opt/spark-1.4.1-bin-hadoop2.6/ml/org.wso2.carbon.ml.commons_1.0.1.SNAPSHOT.jar:/opt/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/opt/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
 -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master 
--ip [master-url] --port 7077 --webui-port 8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/01 11:32:14 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/09/01 11:32:14 WARN Utils: Your hostname, ubuntu resolves to a loopback 
address: 127.0.1.1; using 10.8.108.92 instead (on interface wlan0)
15/09/01 11:32:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
15/09/01 11:32:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/09/01 11:32:15 INFO SecurityManager: Changing view acls to: ubuntu
15/09/01 11:32:15 INFO SecurityManager: Changing modify acls to: ubuntu
15/09/01 11:32:15 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(ubuntu); users 
with modify permissions: Set(ubuntu)
15/09/01 11:32:16 INFO Slf4jLogger: Slf4jLogger started
15/09/01 11:32:16 INFO Remoting: Starting remoting
15/09/01 11:32:16 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkMaster@ubuntu:7077]
15/09/01 11:32:16 INFO Utils: Successfully started service 'sparkMaster' on 
port 7077.
15/09/01 11:32:17 INFO Utils: Successfully started service on port 6066.
15/09/01 11:32:17 INFO StandaloneRestServer: Started REST server for submitting 
applications on port 6066
15/09/01 11:32:17 INFO Master: Starting Spark master at spark://ubuntu:7077
15/09/01 11:32:17 INFO Master: Running Spark version 1.4.1
15/09/01 11:32:17 INFO Utils: Successfully started service 'MasterUI' on port 
8080.
15/09/01 11:32:17 INFO MasterWebUI: Started MasterWebUI at 
http://10.8.108.92:8080
15/09/01 11:32:17 INFO Master: I have been elected leader! New state: ALIVE
15/09/01 11:32:57 INFO Master: Registering app 
ML-SPARK-APPLICATION-0.9453989189195968
15/09/01 11:32:57 INFO Master: Registered app 
ML-SPARK-APPLICATION-0.9453989189195968 with ID app-20150901113257-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Schema From parquet file

2015-09-01 Thread Cheng Lian

What exactly do you mean by "get schema from a parquet file"?

- If you are trying to inspect Parquet files, parquet-tools can be 
pretty neat: https://github.com/Parquet/parquet-mr/issues/321
- If you are trying to get Parquet schema of Parquet MessageType, you 
may resort to readFooterX() and readAllFootersX() utility methods in 
ParquetFileReader
- If you are trying to get Spark SQL StructType schema out of a Parquet 
file, then the most convenient way is to load it as a DataFrame. 
However, "loading" it as a DataFrame doesn't mean we scan the whole 
file. Instead, we only try to do minimum metadata discovery work like 
schema discovery and schema merging.


Cheng

On 9/1/15 7:07 PM, Hafiz Mujadid wrote:

Hi all!

Is there any way to get schema from a parquet file without loading into
dataframe?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-From-parquet-file-tp24535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: extracting file path using dataframes

2015-09-01 Thread Jonathan Coveney
You can make a Hadoop input format which passes through the name of the
file. I generally find it easier to just hit Hadoop, get the file names,
and construct the RDDs though

El martes, 1 de septiembre de 2015, Matt K  escribió:

> Just want to add - I'm looking to partition the resulting Parquet files by
> customer-id, which is why I'm looking to extract the customer-id from the
> path.
>
> On Tue, Sep 1, 2015 at 7:00 PM, Matt K  > wrote:
>
>> Hi all,
>>
>> TL;DR - is there a way to extract the source path from an RDD via the
>> Scala API?
>>
>> I have sequence files on S3 that look something like this:
>> s3://data/customer=123/...
>> s3://data/customer=456/...
>>
>> I am using Spark Dataframes to convert these sequence files to Parquet.
>> As part of the processing, I actually need to know the customer-id. I'm
>> doing something like this:
>>
>> val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", 
>> classOf[BytesWritable],
>> classOf[Text])
>>
>> val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema,
>> delimiter))
>>
>> val dataFrame = sql.createDataFrame(rowRdd, schema)
>>
>>
>> What I am trying to figure out is how to get the customer-id, which is
>> part of the path. I am not sure if there's a way to extract the source path
>> from the resulting HadoopRDD. Do I need to create one RDD per customer to
>> get around this?
>>
>>
>> Thanks,
>>
>> -Matt
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>


Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
Should I use DirectOutputCommitter?
spark.hadoop.mapred.output.committer.class
 com.appsflyer.spark.DirectOutputCommitter



On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov 
wrote:

> I run spark 1.4.1 in amazom aws emr 4.0.0
>
> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
> comparison to emr 3.8  (was 5 sec, now 95 sec)
>
> Actually saveAsTextFile says that it's done in 4.356 sec but after that I
> see lots of INFO messages with 404 error from com.amazonaws.latency logger
> for next 90 sec
>
> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>
> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
> (saveAsTextFile at :22) finished in 4.356 s
> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler
> (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all
> completed, from pool
> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
> :22, took 4.547829 s
> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
> (S3NativeFileSystem.java:listStatus(896)) - listStatus
> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 3B2F06FD11682D22), S3 Extended Request ID:
> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
> RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
> HttpClientSendRequestTime=[0.089],
> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 62C6B413965447FD), S3 Extended Request ID:
> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
> RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
> HttpClientSendRequestTime=[0.068],
> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 4846575A1C373BB9), S3 Extended Request ID:
> aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531],
> HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434],
> RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13],
> 2015-09-01 21:16:17,786 INFO  [main] s3n.S3NativeFileSystem
> 

Re: Group by specific key and save as parquet

2015-09-01 Thread Cheng Lian

Starting from Spark 1.4, you can do this via dynamic partitioning:

sqlContext.table("trade").write.partitionBy("date").parquet("/tmp/path")

Cheng

On 9/1/15 8:27 AM, gtinside wrote:

Hi ,

I have a set of data, I need to group by specific key and then save as
parquet. Refer to the code snippet below. I am querying trade and then
grouping by date

val df = sqlContext.sql("SELECT * FROM trade")
val dfSchema = df.schema
val partitionKeyIndex = dfSchema.fieldNames.seq.indexOf("date")
//group by date
val groupedByPartitionKey = df.rdd.groupBy { row =>
row.getString(partitionKeyIndex) }
val failure = groupedByPartitionKey.map(row => {
val rowDF = sqlContext.createDataFrame(sc.parallelize(row._2.toSeq),
dfSchema)
val fileName = config.getTempFileName(row._1)
try {
 val dest = new Path(fileName)
 if(DefaultFileSystem.getFS.exists(dest)) {
 DefaultFileSystem.getFS.delete(dest, true)
  }
  rowDF.saveAsParquetFile(fileName)
 } catch {
case e : Throwable => {
 logError("Failed to save parquet
file")
}
failure = true
}

This code doesn't work well because of NestedRDD , what is the best way to
solve this problem?

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Group-by-specific-key-and-save-as-parquet-tp24527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: cached data between jobs

2015-09-01 Thread Jeff Zhang
Hi Eric,

If the 2 jobs share the same parent stages. these stages can be skipped for
the second job.

Here's one simple example:

val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e))
val rdd2 = rdd1.groupByKey()
rdd2.map(e=>e._1).collect() foreach println
rdd2.map(e=> (e._1, e._2.size)).collect foreach println

Obviously, there are 2 jobs and both of them have 2 stages. Luckily here
these 2 jobs share the same stage (the first stage of each job), although
you doesn't cache these data explicitly, once one stage is completed, it is
marked as available and can used for other jobs. so for the second job, it
only needs to run one stage.
You should be able to see the skipped stage in the spark job ui.



[image: Inline image 1]

On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker  wrote:

> Hi,
>
> I'm noticing that a 30 minute job that was initially IO-bound may not be
> during subsequent runs.  Is there some kind of between-job caching that
> happens in Spark or in Linux that outlives jobs and that might be making
> subsequent runs faster?  If so, is there a way to avoid the caching in
> order to get a better sense of the worst-case scenario?
>
> (It's also possible that I've simply changed something that made things
> faster.)
>
> Eric
>
>


-- 
Best Regards

Jeff Zhang


Re: Conditionally do things different on the first minibatch vs subsequent minibatches in a dstream

2015-09-01 Thread Ted Yu
Can you utilize the following method in StreamingListener ?

  override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) {

Cheers

On Tue, Sep 1, 2015 at 12:36 PM, steve_ash  wrote:

> We have some logic that we need to apply while we are processing the events
> in the first minibatch only.  For the second, third, etc. minibatches we
> don't need to do this special logic.  I can't just do it as a one time
> thing
> - I need to modify a field on the events in the first minibatch.
>
> One approach:
> 1- create inp = inputDStream
> 2- call inp.foreachRdd(rdd -> rdd.foreachPartition( isFirst = true ) )
> 3- do m1 = inp.map( if (isFirst) setStuffOnEvent )
> 4- do m2 = m1.mapPartition( isFirst = false )
> 5- do m2.foreachRdd ( rest of my stuff)
>
> since foreachRdd in (2) is an action that just happens once right at the
> beginning? Or does this happen for every minibatch just like the
> foreachRdd() in (5) does?
>
> What is another way that we accomplish this need of setting something on
> the
> events in the first minibatch and not others?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Conditionally-do-things-different-on-the-first-minibatch-vs-subsequent-minibatches-in-a-dstream-tp24547.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: extracting file path using dataframes

2015-09-01 Thread Matt K
Just want to add - I'm looking to partition the resulting Parquet files by
customer-id, which is why I'm looking to extract the customer-id from the
path.

On Tue, Sep 1, 2015 at 7:00 PM, Matt K  wrote:

> Hi all,
>
> TL;DR - is there a way to extract the source path from an RDD via the
> Scala API?
>
> I have sequence files on S3 that look something like this:
> s3://data/customer=123/...
> s3://data/customer=456/...
>
> I am using Spark Dataframes to convert these sequence files to Parquet. As
> part of the processing, I actually need to know the customer-id. I'm doing
> something like this:
>
> val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", 
> classOf[BytesWritable],
> classOf[Text])
>
> val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter
> ))
>
> val dataFrame = sql.createDataFrame(rowRdd, schema)
>
>
> What I am trying to figure out is how to get the customer-id, which is
> part of the path. I am not sure if there's a way to extract the source path
> from the resulting HadoopRDD. Do I need to create one RDD per customer to
> get around this?
>
>
> Thanks,
>
> -Matt
>



-- 
www.calcmachine.com - easy online calculator.


extracting file path using dataframes

2015-09-01 Thread Matt K
Hi all,

TL;DR - is there a way to extract the source path from an RDD via the Scala
API?

I have sequence files on S3 that look something like this:
s3://data/customer=123/...
s3://data/customer=456/...

I am using Spark Dataframes to convert these sequence files to Parquet. As
part of the processing, I actually need to know the customer-id. I'm doing
something like this:

val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*",
classOf[BytesWritable],
classOf[Text])

val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter))

val dataFrame = sql.createDataFrame(rowRdd, schema)


What I am trying to figure out is how to get the customer-id, which is part
of the path. I am not sure if there's a way to extract the source path from
the resulting HadoopRDD. Do I need to create one RDD per customer to get
around this?


Thanks,

-Matt


Re: Hung spark executors don't count toward worker memory limit

2015-09-01 Thread hai
Hi Keith, we are running into the same issue here with Spark standalone
1.2.1. I was wondering if you have found a solution or workaround.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hung-spark-executors-don-t-count-toward-worker-memory-limit-tp16083p24548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I run spark 1.4.1 in amazom aws emr 4.0.0

For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
comparison to emr 3.8  (was 5 sec, now 95 sec)

Actually saveAsTextFile says that it's done in 4.356 sec but after that I
see lots of INFO messages with 404 error from com.amazonaws.latency logger
for next 90 sec

spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
"A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
(saveAsTextFile at :22) finished in 4.356 s
2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler
(Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all
completed, from pool
2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
:22, took 4.547829 s
2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 3B2F06FD11682D22), S3 Extended Request ID:
C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
HttpClientSendRequestTime=[0.089],
2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 62C6B413965447FD), S3 Extended Request ID:
4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
HttpClientSendRequestTime=[0.068],
2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 4846575A1C373BB9), S3 Extended Request ID:
aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531],
HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434],
RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13],
2015-09-01 21:16:17,786 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://foo-bar/tmp/test40_20/_temporary/0/task_201509012116_0005_m_00
with recursive false
2015-09-01 21:16:17,798 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error 

Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Sean Owen
(pedantic: it's the log-probabilities)

On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang  wrote:
> Actually
> brzPi + brzTheta * testData.toBreeze
> is the probabilities of the input Vector on each class, however it's a
> Breeze Vector.
> Pay attention the index of this Vector need to map to the corresponding
> label index.
>
> 2015-08-28 20:38 GMT+08:00 Adamantios Corais :
>>
>> Hi,
>>
>> I am trying to change the following code so as to get the probabilities of
>> the input Vector on each class (instead of the class itself with the highest
>> probability). I know that this is already available as part of the most
>> recent release of Spark but I have to use Spark 1.1.0.
>>
>> Any help is appreciated.
>>
>>> override def predict(testData: Vector): Double = {
>>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
>>>   }
>>
>>
>>>
>>> https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>>
>>
>> // Adamantios
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark shell and StackOverFlowError

2015-09-01 Thread ponkin
Hi,
Can not reproduce your error on Spark 1.2.1 . It is not enough information.
What is your command line arguments wцру you starting spark-shell? what data
are you reading? etc.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508p24531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
Hi,

You just need to extend Partitioner and override the numPartitions and
getPartition methods, see below

class MyPartitioner extends partitioner {
  def numPartitions: Int = // Return the number of partitions
  def getPartition(key Any): Int = // Return the partition for a given key
}

On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
wrote:

> Hi Sparkians
>
> How can we create a customer partition in pyspark
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
Ah sorry I miss read your question. In pyspark it looks like you just need
to instantiate the Partitioner class with numPartitions and partitionFunc.

On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf  wrote:

> Hi
>
> I did not get this, e.g if i need to create a custom partitioner like
> range partitioner.
>
> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:
>
>> Hi,
>>
>> You just need to extend Partitioner and override the numPartitions and
>> getPartition methods, see below
>>
>> class MyPartitioner extends partitioner {
>>   def numPartitions: Int = // Return the number of partitions
>>   def getPartition(key Any): Int = // Return the partition for a given key
>> }
>>
>> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
>> wrote:
>>
>>> Hi Sparkians
>>>
>>> How can we create a customer partition in pyspark
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>
> --
> with Regards
> Shahid Ashraf
>


HiveThriftServer not registering with Zookeeper

2015-09-01 Thread sreeramvenkat
Hi,

 I am trying to setup dynamic service discovery for HiveThriftServer in a
two node cluster.

In the thrift server logs, I am not seeing itself registering with zookeeper
- no znode is getting created.

Pasting relevant section from my $SPARK_HOME/conf/hive-site.xml


  hive.zookeeper.quorum
  host1:port1,host2:port2



  hive.server2.support.dynamic.service.discovery
  true



  hive.server2.zookeeper.namespace
  hivethriftserver2



  hive.zookeeper.client.port
  2181



 Any help is appreciated.

PS: Zookeeper is working fine and zknodes are getting created with
hiveserver2. This issue happens only with hivethriftserver.

Regards,
Sreeram



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-not-registering-with-Zookeeper-tp24534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.5 sort slow

2015-09-01 Thread patcharee

Hi,

I found spark 1.5 sorting is very slow compared to spark 1.4. Below is 
my code snippet


val sqlRDD = sql("select date, u, v, z from fino3_hr3 where zone == 
2 and z >= 2 and z <= order by date, z")

println("sqlRDD " + sqlRDD.count())

The fino3_hr3 (in the sql command) is a hive table in orc format, 
partitioned by zone and z.


Spark 1.5 takes 4.5 mins to execute this sql, while spark 1.4 takes 1.5 
mins. I noticed that dissimilar to spark 1.4 when spark 1.5 sorted, data 
was shuffled into few tasks, not divided for all tasks. Do I need to set 
any configuration explicitly? Any suggestions?


BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
Did you start spark cluster using command sbin/start-all.sh ?
You should have 2 log files under folder if it is single-node cluster. Like
the following

spark-jzhang-org.apache.spark.deploy.master.Master-1-jzhangMBPr.local.out
spark-jzhang-org.apache.spark.deploy.worker.Worker-1-jzhangMBPr.local.out



On Tue, Sep 1, 2015 at 4:01 PM, Madawa Soysa 
wrote:

> There are no logs which includes apache.spark.deploy.worker in file name
> in the SPARK_HOME/logs folder.
>
> On 1 September 2015 at 13:00, Jeff Zhang  wrote:
>
>> This is master log. There's no worker registration info in the log. That
>> means the worker may not start properly. Please check the log file
>> with apache.spark.deploy.worker in file name.
>>
>>
>>
>> On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa 
>> wrote:
>>
>>> I cannot see anything abnormal in logs. What would be the reason for not
>>> availability of executors?
>>>
>>> On 1 September 2015 at 12:24, Madawa Soysa 
>>> wrote:
>>>
 Following are the logs available. Please find the attached.

 On 1 September 2015 at 12:18, Jeff Zhang  wrote:

> It's in SPARK_HOME/logs
>
> Or you can check the spark web ui. http://[master-machine]:8080
>
>
> On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa 
> wrote:
>
>> How do I check worker logs? SPARK_HOME/work folder does not exist. I
>> am using the spark standalone mode.
>>
>> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>>
>>> No executors ? Please check the worker logs if you are using spark
>>> standalone mode.
>>>
>>> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa <
>>> madawa...@cse.mrt.ac.lk> wrote:
>>>
 Hi All,

 I have successfully submitted some jobs to spark master. But the
 jobs won't progress and not finishing. Please see the attached 
 screenshot.
 These are fairly very small jobs and this shouldn't take more than a 
 minute
 to finish.

 I'm new to spark and any help would be appreciated.

 Thanks,
 Madawa.



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of Moratuwa.
>>
>>
>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>> madawa...@cse.mrt.ac.lk
>> LinkedIn  | Twitter
>>  | Tumblr 
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



 --

 *_**Madawa Soysa*

 Undergraduate,

 Department of Computer Science and Engineering,

 University of Moratuwa.


 Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
 madawa...@cse.mrt.ac.lk
 LinkedIn  | Twitter
  | Tumblr 

>>>
>>>
>>>
>>> --
>>>
>>> *_**Madawa Soysa*
>>>
>>> Undergraduate,
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> University of Moratuwa.
>>>
>>>
>>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>>> madawa...@cse.mrt.ac.lk
>>> LinkedIn  | Twitter
>>>  | Tumblr 
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



-- 
Best Regards

Jeff Zhang


Re: bulk upload to Elasticsearch and shuffle behavior

2015-09-01 Thread Igor Berman
Hi Eric,
I see that you solved your problem. Imho, when you do repartition you split
your work into 2 stages, so your hbase lookup happens at first stage, and
upload to ES happens after shuffle on next stage, so without repartition
it's hard to tell where is ES upload and where is Hbase lookup time.

If you don't mind it's interesting if you reduce number of partitions
before uploading to ES? Do you have some rule of thumb on how much
partitions should be there before uploading to ES?
We have kind of same pipeline and we reduce # of partitions to 8 or so
before uploading to ES(probably depends on ES cluster strength)


On 1 September 2015 at 06:05, Eric Walker  wrote:

> I think I have found out what was causing me difficulties.  It seems I was
> reading too much into the stage description shown in the "Stages" tab of
> the Spark application UI.  While it said "repartition at
> NativeMethodAccessorImpl.java:-2", I can infer from the network traffic and
> from its response to changes I subsequently made that the actual code that
> was running was the code doing the HBase lookups.  I suspect the actual
> shuffle, once it occurred, required on the same order of network IO as the
> upload to Elasticsearch that followed.
>
> Eric
>
>
>
> On Mon, Aug 31, 2015 at 6:09 PM, Eric Walker 
> wrote:
>
>> Hi,
>>
>> I am working on a pipeline that carries out a number of stages, the last
>> of which is to build some large JSON objects from information in the
>> preceding stages.  The JSON objects are then uploaded to Elasticsearch in
>> bulk.
>>
>> If I carry out a shuffle via a `repartition` call after the JSON
>> documents have been created, the upload to ES is fast.  But the shuffle
>> itself takes many tens of minutes and is IO-bound.
>>
>> If I omit the repartition, the upload to ES takes a long time due to a
>> complete lack of parallelism.
>>
>> Currently, the step that precedes the assembling of the JSON documents,
>> which goes into the final repartition call, is the querying of pairs of
>> object ids.  In a mapper the ids are resolved to documents by querying
>> HBase.  The initial pairs of ids are obtained via a query against the SQL
>> context, and the query result is repartitioned before going into the mapper
>> that resolves the ids into documents.
>>
>> It's not clear to me why the final repartition preceding the upload to ES
>> is required.  I would like to omit it, since it is so expensive and
>> involves so much network IO, but have not found a way to do this yet.  If I
>> omit the repartition, the job takes much longer.
>>
>> Does anyone know what might be going on here, and what I might be able to
>> do to get rid of the last `repartition` call before the upload to ES?
>>
>> Eric
>>
>>
>


How to determine the value for spark.sql.shuffle.partitions?

2015-09-01 Thread Romi Kuntsman
Hi all,

The number of partition greatly affect the speed and efficiency of
calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.

Too few partitions with large data cause OOM exceptions.
Too many partitions on small data cause a delay due to overhead.

How do you programmatically determine the optimal number of partitions and
cores in Spark, as a function of:

   1. available memory per core
   2. number of records in input data
   3. average/maximum record size
   4. cache configuration
   5. shuffle configuration
   6. serialization
   7. etc?

Any general best practices?

Thanks!

Romi K.


Re: How to compute the probability of each class in Naive Bayes

2015-09-01 Thread Yanbo Liang
Actually
brzPi + brzTheta * testData.toBreeze
is the probabilities of the input Vector on each class, however it's a
Breeze Vector.
Pay attention the index of this Vector need to map to the corresponding
label index.

2015-08-28 20:38 GMT+08:00 Adamantios Corais :

> Hi,
>
> I am trying to change the following code so as to get the probabilities of
> the input Vector on each class (instead of the class itself with the
> highest probability). I know that this is already available as part of the
> most recent release of Spark but I have to use Spark 1.1.0.
>
> Any help is appreciated.
>
> override def predict(testData: Vector): Double = {
>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
>>   }
>
>
>
>> https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>
>
>
> *// Adamantios*
>
>
>


Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
I used ./sbin/start-master.sh

When I used ./sbin/start-all.sh the start fails. I get the following error.

failed to launch org.apache.spark.deploy.master.Master:
localhost: ssh: connect to host localhost port 22: Connection refused

On 1 September 2015 at 13:41, Jeff Zhang  wrote:

> Did you start spark cluster using command sbin/start-all.sh ?
> You should have 2 log files under folder if it is single-node cluster.
> Like the following
>
> spark-jzhang-org.apache.spark.deploy.master.Master-1-jzhangMBPr.local.out
> spark-jzhang-org.apache.spark.deploy.worker.Worker-1-jzhangMBPr.local.out
>
>
>
> On Tue, Sep 1, 2015 at 4:01 PM, Madawa Soysa 
> wrote:
>
>> There are no logs which includes apache.spark.deploy.worker in file name
>> in the SPARK_HOME/logs folder.
>>
>> On 1 September 2015 at 13:00, Jeff Zhang  wrote:
>>
>>> This is master log. There's no worker registration info in the log. That
>>> means the worker may not start properly. Please check the log file
>>> with apache.spark.deploy.worker in file name.
>>>
>>>
>>>
>>> On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa 
>>> wrote:
>>>
 I cannot see anything abnormal in logs. What would be the reason for
 not availability of executors?

 On 1 September 2015 at 12:24, Madawa Soysa 
 wrote:

> Following are the logs available. Please find the attached.
>
> On 1 September 2015 at 12:18, Jeff Zhang  wrote:
>
>> It's in SPARK_HOME/logs
>>
>> Or you can check the spark web ui. http://[master-machine]:8080
>>
>>
>> On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa > > wrote:
>>
>>> How do I check worker logs? SPARK_HOME/work folder does not exist. I
>>> am using the spark standalone mode.
>>>
>>> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>>>
 No executors ? Please check the worker logs if you are using spark
 standalone mode.

 On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa <
 madawa...@cse.mrt.ac.lk> wrote:

> Hi All,
>
> I have successfully submitted some jobs to spark master. But the
> jobs won't progress and not finishing. Please see the attached 
> screenshot.
> These are fairly very small jobs and this shouldn't take more than a 
> minute
> to finish.
>
> I'm new to spark and any help would be appreciated.
>
> Thanks,
> Madawa.
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>
>>> *_**Madawa Soysa*
>>>
>>> Undergraduate,
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> University of Moratuwa.
>>>
>>>
>>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>>> madawa...@cse.mrt.ac.lk
>>> LinkedIn  | Twitter
>>>  | Tumblr
>>> 
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



 --

 *_**Madawa Soysa*

 Undergraduate,

 Department of Computer Science and Engineering,

 University of Moratuwa.


 Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
 madawa...@cse.mrt.ac.lk
 LinkedIn  | Twitter
  | Tumblr 

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of Moratuwa.
>>
>>
>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>> madawa...@cse.mrt.ac.lk
>> LinkedIn  | Twitter
>>  | Tumblr 
>>
>
>
>
> --
> Best Regards
>
> 

Custom Partitioner

2015-09-01 Thread shahid qadri
Hi Sparkians

How can we create a customer partition in pyspark

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to effieciently write sorted neighborhood in pyspark

2015-09-01 Thread shahid qadri

> On Aug 25, 2015, at 10:43 PM, shahid qadri  wrote:
> 
> Any resources on this
> 
>> On Aug 25, 2015, at 3:15 PM, shahid qadri  wrote:
>> 
>> I would like to implement sorted neighborhood approach in spark, what is the 
>> best way to write that in pyspark.
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
Hi

I did not get this, e.g if i need to create a custom partitioner like range
partitioner.

On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:

> Hi,
>
> You just need to extend Partitioner and override the numPartitions and
> getPartition methods, see below
>
> class MyPartitioner extends partitioner {
>   def numPartitions: Int = // Return the number of partitions
>   def getPartition(key Any): Int = // Return the partition for a given key
> }
>
> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
> wrote:
>
>> Hi Sparkians
>>
>> How can we create a customer partition in pyspark
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
with Regards
Shahid Ashraf


Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
Hi

I think range partitioner is not available in pyspark, so if we want create
one. how should we create that. my question is that.

On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker  wrote:

> Ah sorry I miss read your question. In pyspark it looks like you just need
> to instantiate the Partitioner class with numPartitions and partitionFunc.
>
> On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf  wrote:
>
>> Hi
>>
>> I did not get this, e.g if i need to create a custom partitioner like
>> range partitioner.
>>
>> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:
>>
>>> Hi,
>>>
>>> You just need to extend Partitioner and override the numPartitions and
>>> getPartition methods, see below
>>>
>>> class MyPartitioner extends partitioner {
>>>   def numPartitions: Int = // Return the number of partitions
>>>   def getPartition(key Any): Int = // Return the partition for a given
>>> key
>>> }
>>>
>>> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
>>> wrote:
>>>
 Hi Sparkians

 How can we create a customer partition in pyspark

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>
>>
>> --
>> with Regards
>> Shahid Ashraf
>>
>


-- 
with Regards
Shahid Ashraf


Re: Is it possible to create spark cluster in different network?

2015-09-01 Thread Akhil Das
Did you try with SPARK_LOCAL_IP?

Thanks
Best Regards

On Tue, Sep 1, 2015 at 12:29 AM, sakana  wrote:

> Hi
>
> I am successful create Spark cluster in openStack.
> I want to create spark cluster in different openStack sites.
>
> In openstack, if you create instance, it only know it's private ip ( like
> 10.x.y.z ), it will not know it have public IP for itself. ( I try to
> export
> SPARK_MASTER_IP=xxx.xxx.xxx.xxx with public ip, but log show it can't bind
> it )
>
>
> so, the port 7070 is listen on private IP address , and refuse connect for
> other network ip
>
> $ netstat -tupln
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address   Foreign Address State
> PID/Program name
> tcp0  0 0.0.0.0:21  0.0.0.0:*   LISTEN
> -
> tcp0  0 0.0.0.0:50070   0.0.0.0:*   LISTEN
> 12921/java
> tcp0  0 0.0.0.0:22  0.0.0.0:*   LISTEN
> -
> tcp0  0 10.103.0.67:90000.0.0.0:*   LISTEN
> 12921/java
> tcp0  0 0.0.0.0:50090   0.0.0.0:*   LISTEN
> 13163/java
> tcp6   0  0 :::22   :::*LISTEN
> -
> tcp6   0  0 10.103.0.67:7077:::*LISTEN
> 25488/java
> tcp6   0  0 :::8080 :::*LISTEN
> 25488/java
> tcp6   0  0 127.255.255.1:6066  :::*LISTEN
> 25488/java
> udp0  0 0.0.0.0:64062   0.0.0.0:*
> -
> udp0  0 0.0.0.0:68  0.0.0.0:*
> -
> udp6   0  0 :::59067:::*
>
> Is there any possible listen 0.0.0.0:7077 with spark?
>
> I want the other slaves to connect master with public IP in openstack.
>
>
>
> Thanks for kindly help
>
>
>
> Max
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-create-spark-cluster-in-different-network-tp24524.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
something like...

class RangePartitioner(Partitioner):
def __init__(self, numParts):
self.numPartitions = numParts
self.partitionFunction = rangePartition
def rangePartition(key):
# Logic to turn key into a partition id
return id

On Tue, Sep 1, 2015 at 11:38 AM shahid ashraf  wrote:

> Hi
>
> I think range partitioner is not available in pyspark, so if we want
> create one. how should we create that. my question is that.
>
> On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker  wrote:
>
>> Ah sorry I miss read your question. In pyspark it looks like you just
>> need to instantiate the Partitioner class with numPartitions and
>> partitionFunc.
>>
>> On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf  wrote:
>>
>>> Hi
>>>
>>> I did not get this, e.g if i need to create a custom partitioner like
>>> range partitioner.
>>>
>>> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:
>>>
 Hi,

 You just need to extend Partitioner and override the numPartitions and
 getPartition methods, see below

 class MyPartitioner extends partitioner {
   def numPartitions: Int = // Return the number of partitions
   def getPartition(key Any): Int = // Return the partition for a given
 key
 }

 On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
 wrote:

> Hi Sparkians
>
> How can we create a customer partition in pyspark
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>
>>>
>>> --
>>> with Regards
>>> Shahid Ashraf
>>>
>>
>
>
> --
> with Regards
> Shahid Ashraf
>


Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
There are no logs which includes apache.spark.deploy.worker in file name in
the SPARK_HOME/logs folder.

On 1 September 2015 at 13:00, Jeff Zhang  wrote:

> This is master log. There's no worker registration info in the log. That
> means the worker may not start properly. Please check the log file
> with apache.spark.deploy.worker in file name.
>
>
>
> On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa 
> wrote:
>
>> I cannot see anything abnormal in logs. What would be the reason for not
>> availability of executors?
>>
>> On 1 September 2015 at 12:24, Madawa Soysa 
>> wrote:
>>
>>> Following are the logs available. Please find the attached.
>>>
>>> On 1 September 2015 at 12:18, Jeff Zhang  wrote:
>>>
 It's in SPARK_HOME/logs

 Or you can check the spark web ui. http://[master-machine]:8080


 On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa 
 wrote:

> How do I check worker logs? SPARK_HOME/work folder does not exist. I
> am using the spark standalone mode.
>
> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>
>> No executors ? Please check the worker logs if you are using spark
>> standalone mode.
>>
>> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa > > wrote:
>>
>>> Hi All,
>>>
>>> I have successfully submitted some jobs to spark master. But the
>>> jobs won't progress and not finishing. Please see the attached 
>>> screenshot.
>>> These are fairly very small jobs and this shouldn't take more than a 
>>> minute
>>> to finish.
>>>
>>> I'm new to spark and any help would be appreciated.
>>>
>>> Thanks,
>>> Madawa.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>
>>> *_**Madawa Soysa*
>>>
>>> Undergraduate,
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> University of Moratuwa.
>>>
>>>
>>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>>> madawa...@cse.mrt.ac.lk
>>> LinkedIn  | Twitter
>>>  | Tumblr 
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of Moratuwa.
>>
>>
>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>> madawa...@cse.mrt.ac.lk
>> LinkedIn  | Twitter
>>  | Tumblr 
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 

*_**Madawa Soysa*

Undergraduate,

Department of Computer Science and Engineering,

University of Moratuwa.


Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
madawa...@cse.mrt.ac.lk
LinkedIn  | Twitter
 | Tumblr 


Re: Spark executor OOM issue on YARN

2015-09-01 Thread ponkin
Hi,
Can you please post your stack trace with exceptions? and also command line
attributes in spark-submit?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522p24530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




Re: Submitted applications does not run.

2015-09-01 Thread Jeff Zhang
You need to make yourself able to ssh to localhost without password, please
check this blog.

http://hortonworks.com/kb/generating-ssh-keys-for-passwordless-login/



On Tue, Sep 1, 2015 at 4:31 PM, Madawa Soysa 
wrote:

> I used ./sbin/start-master.sh
>
> When I used ./sbin/start-all.sh the start fails. I get the following error.
>
> failed to launch org.apache.spark.deploy.master.Master:
> localhost: ssh: connect to host localhost port 22: Connection refused
>
> On 1 September 2015 at 13:41, Jeff Zhang  wrote:
>
>> Did you start spark cluster using command sbin/start-all.sh ?
>> You should have 2 log files under folder if it is single-node cluster.
>> Like the following
>>
>> spark-jzhang-org.apache.spark.deploy.master.Master-1-jzhangMBPr.local.out
>> spark-jzhang-org.apache.spark.deploy.worker.Worker-1-jzhangMBPr.local.out
>>
>>
>>
>> On Tue, Sep 1, 2015 at 4:01 PM, Madawa Soysa 
>> wrote:
>>
>>> There are no logs which includes apache.spark.deploy.worker in file
>>> name in the SPARK_HOME/logs folder.
>>>
>>> On 1 September 2015 at 13:00, Jeff Zhang  wrote:
>>>
 This is master log. There's no worker registration info in the log.
 That means the worker may not start properly. Please check the log file
 with apache.spark.deploy.worker in file name.



 On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa 
 wrote:

> I cannot see anything abnormal in logs. What would be the reason for
> not availability of executors?
>
> On 1 September 2015 at 12:24, Madawa Soysa 
> wrote:
>
>> Following are the logs available. Please find the attached.
>>
>> On 1 September 2015 at 12:18, Jeff Zhang  wrote:
>>
>>> It's in SPARK_HOME/logs
>>>
>>> Or you can check the spark web ui. http://[master-machine]:8080
>>>
>>>
>>> On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa <
>>> madawa...@cse.mrt.ac.lk> wrote:
>>>
 How do I check worker logs? SPARK_HOME/work folder does not exist.
 I am using the spark standalone mode.

 On 1 September 2015 at 12:05, Jeff Zhang  wrote:

> No executors ? Please check the worker logs if you are using spark
> standalone mode.
>
> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa <
> madawa...@cse.mrt.ac.lk> wrote:
>
>> Hi All,
>>
>> I have successfully submitted some jobs to spark master. But the
>> jobs won't progress and not finishing. Please see the attached 
>> screenshot.
>> These are fairly very small jobs and this shouldn't take more than a 
>> minute
>> to finish.
>>
>> I'm new to spark and any help would be appreciated.
>>
>> Thanks,
>> Madawa.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



 --

 *_**Madawa Soysa*

 Undergraduate,

 Department of Computer Science and Engineering,

 University of Moratuwa.


 Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
 madawa...@cse.mrt.ac.lk
 LinkedIn  | Twitter
  | Tumblr
 

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of Moratuwa.
>>
>>
>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>> madawa...@cse.mrt.ac.lk
>> LinkedIn  | Twitter
>>  | Tumblr 
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr 
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>

How mature is spark sql

2015-09-01 Thread rakesh sharma
Is it mature enough to use it extensively. I see that it is easier to do than 
writing map/reduce  in java.We are being asked to do it in java itself and 
cannot move to python and scala.
thanksrakesh  

Schema From parquet file

2015-09-01 Thread Hafiz Mujadid
Hi all!

Is there any way to get schema from a parquet file without loading into
dataframe?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-From-parquet-file-tp24535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



reading multiple parquet file using spark sql

2015-09-01 Thread Hafiz Mujadid
Hi 

I want to read multiple parquet files using spark sql load method. just like
we can pass multiple comma separated path to sc.textfile method. Is ther
anyway to do the same ?


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reading-multiple-parquet-file-using-spark-sql-tp24537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Hi

In spark streaming 1.3 with kafka- when does driver bring latest offsets of
this run - at start of each batch or at time when  batch gets queued ?

Say few of my batches take longer time to complete than their batch
interval. So some of batches will go in queue. Will driver waits for
 queued batches to get started or just brings the latest offsets before
they even actually started. And when they start running they will work on
old offsets brought at time when they were queued.


Spark job killed

2015-09-01 Thread Silvio Bernardinello
Hi,

We are running Spark 1.4.0 on a Mesosphere cluster (~250GB memory with 16
activated hosts).
Spark jobs are submitted in coarse mode.

Suddenly, our jobs get killed without any error.

ip-10-0-2-193.us-west-2.compute.internal, PROCESS_LOCAL, 1514 bytes)
15/09/01 10:48:24 INFO TaskSetManager: Finished task 38047.0 in stage 0.0
(TID 38160) in 2856 ms on ip-10-0-0-203.us-west-2.compute.internal
(38048/44617)
15/09/01 10:48:24 INFO TaskSetManager: Starting task 38056.0 in stage 0.0
(TID 38169, ip-10-0-0-204.us-west-2.compute.internal, PROCESS_LOCAL, 1514
bytes)
15/09/01 10:48:24 INFO TaskSetManager: Starting task 38057.0 in stage 0.0
(TID 38170, ip-10-0-0-204.us-west-2.compute.internal, PROCESS_LOCAL, 1514
bytes)
15/09/01 10:48:25 INFO TaskSetManager: Finished task 38048.0 in stage 0.0
(TID 38161) in 2290 ms on ip-10-0-2-194.us-west-2.compute.internal
(38049/44617)
Killed

Where can we find additional information to this issue?

Thank in advance

Silvio



__


*Silvio Bernardinello * |  Data Engineer


Milan | Rome | New York | Shanghai


  


Beintoo Spa - Corso di Porta Romana, 68 - 20122 Milano - Italy - Office
(+39) 02.97.687.959

This email is reserved exclusively for sending and receiving messages
inherent working activities, and is not intended nor authorized for
personal use. Therefore, any outgoing messages or incoming response
messages will be treated as company messages and will be subject to the
corporate IT policy and may possibly to be read by persons other than by
the subscriber of the box. Confidential information may be contained in
this message. If you are not the address indicated in this message, please
do not copy or deliver this message to anyone. In such case, you should
notify the sender immediately and delete the original message.


Re: Spark job killed

2015-09-01 Thread Akhil Das
If it is not some other user then its the kernal triggering the kill, it
might be using way too much memory or swap. Check your resource usage while
the job is running and see the memory overhead etc.

Thanks
Best Regards

On Tue, Sep 1, 2015 at 5:56 PM, Silvio Bernardinello <
sbernardine...@beintoo.com> wrote:

> Hi,
>
> We are running Spark 1.4.0 on a Mesosphere cluster (~250GB memory with 16
> activated hosts).
> Spark jobs are submitted in coarse mode.
>
> Suddenly, our jobs get killed without any error.
>
> ip-10-0-2-193.us-west-2.compute.internal, PROCESS_LOCAL, 1514 bytes)
> 15/09/01 10:48:24 INFO TaskSetManager: Finished task 38047.0 in stage 0.0
> (TID 38160) in 2856 ms on ip-10-0-0-203.us-west-2.compute.internal
> (38048/44617)
> 15/09/01 10:48:24 INFO TaskSetManager: Starting task 38056.0 in stage 0.0
> (TID 38169, ip-10-0-0-204.us-west-2.compute.internal, PROCESS_LOCAL, 1514
> bytes)
> 15/09/01 10:48:24 INFO TaskSetManager: Starting task 38057.0 in stage 0.0
> (TID 38170, ip-10-0-0-204.us-west-2.compute.internal, PROCESS_LOCAL, 1514
> bytes)
> 15/09/01 10:48:25 INFO TaskSetManager: Finished task 38048.0 in stage 0.0
> (TID 38161) in 2290 ms on ip-10-0-2-194.us-west-2.compute.internal
> (38049/44617)
> Killed
>
> Where can we find additional information to this issue?
>
> Thank in advance
>
> Silvio
>
>
>
> __
>
>
> *Silvio Bernardinello * |  Data Engineer
>
>
> Milan | Rome | New York | Shanghai
>
> 
>   
> 
>
> Beintoo Spa - Corso di Porta Romana, 68 - 20122 Milano - Italy - Office
> (+39) 02.97.687.959
>
> This email is reserved exclusively for sending and receiving messages
> inherent working activities, and is not intended nor authorized for
> personal use. Therefore, any outgoing messages or incoming response
> messages will be treated as company messages and will be subject to the
> corporate IT policy and may possibly to be read by persons other than by
> the subscriber of the box. Confidential information may be contained in
> this message. If you are not the address indicated in this message, please
> do not copy or deliver this message to anyone. In such case, you should
> notify the sender immediately and delete the original message.
>


Re: Potential NPE while exiting spark-shell

2015-09-01 Thread nasokan
bump



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Potential-NPE-while-exiting-spark-shell-tp24523p24539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error using spark.driver.userClassPathFirst=true

2015-09-01 Thread cgalan
Hi,

When I am submitting a spark job in the mode "yarn-cluster" with the
parameter "spark.driver.userClassPathFirst", my job fails; but if I don't
use this params, my job is concluded with success.. My environment is some
nodes with CDH5.4 and Spark 1.3.0.

Spark submit with fail:
spark-submit --class Main --master yarn-cluster --conf
spark.driver.userClassPathFirst=true --conf
spark.executor.userClassPathFirst=true Main.jar

Spark-submit with success:
spark-submit --class Main --master yarn-cluster Main.jar

Error with Snappy:
java.lang.UnsatisfiedLinkError:
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at 
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
at
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:157)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:199)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:199)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:199)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:84)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:761)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:589)

My example error code:
public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Prueba");
//  conf.setMaster("local"); //Comentar para lanzar en el cluster

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD asd = sc.textFile("text.txt");
asd.count();

sc.close();
}

Does anyone have any suggestions? The problem of why I need to use
"spark.driver.userClassPathFirst=true" is because I use commons-cli-1.3.1 in
my project, and the spark classpath have one previous version, so  I create
a shade-jar, but without "spark.driver.userClassPathFirst=true" I have
problems of conflicts with the dependencies between my spark.jar and spark
classpath that one class uses the previous class version of classpath
instead of the last version.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-using-spark-driver-userClassPathFirst-true-tp24536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I checked previous emr config (emr-3.8)
mapred-site.xml has the following setting

mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter



On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov 
wrote:

> Should I use DirectOutputCommitter?
> spark.hadoop.mapred.output.committer.class
>  com.appsflyer.spark.DirectOutputCommitter
>
>
>
> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov 
> wrote:
>
>> I run spark 1.4.1 in amazom aws emr 4.0.0
>>
>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
>> comparison to emr 3.8  (was 5 sec, now 95 sec)
>>
>> Actually saveAsTextFile says that it's done in 4.356 sec but after that I
>> see lots of INFO messages with 404 error from com.amazonaws.latency logger
>> for next 90 sec
>>
>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>>
>> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
>> (saveAsTextFile at :22) finished in 4.356 s
>> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
>> whose tasks have all completed, from pool
>> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
>> :22, took 4.547829 s
>> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
>> (S3NativeFileSystem.java:listStatus(896)) - listStatus
>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
>> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 3B2F06FD11682D22), S3 Extended Request ID:
>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
>> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>> RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
>> HttpClientSendRequestTime=[0.089],
>> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 62C6B413965447FD), S3 Extended Request ID:
>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
>> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>> RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
>> HttpClientSendRequestTime=[0.068],
>> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 4846575A1C373BB9), S3 Extended Request ID:
>> aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> 

Spark + Druid

2015-09-01 Thread Harish Butani
Hi,

I am working on the Spark Druid Package:
https://github.com/SparklineData/spark-druid-olap.
For scenarios where a 'raw event' dataset is being indexed in Druid it
enables you to write your Logical Plans(queries/dataflows) against the 'raw
event' dataset and it rewrites parts of the plan to execute as a Druid
Query. In Spark the configuration of a Druid DataSource is somewhat like
configuring an OLAP index in a traditional DB. Early results show
significant speedup of pushing slice and dice queries to Druid.

It comprises of a Druid DataSource that wraps the 'raw event' dataset and
has knowledge of the Druid Index; and a DruidPlanner which is a set of plan
rewrite strategies to convert Aggregation queries into a Plan having a
DruidRDD.

Here

is
a detailed design document, which also describes a benchmark of
representative queries on the TPCH dataset.

Looking for folks who would be willing to try this out and/or contribute.

regards,
Harish Butani.


Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
Since in my app , after processing the events I am posting the events to
some external server- if external server is down - I want to backoff
consuming from kafka. But I can't stop and restart the consumer since it
needs manual effort.

Backing off few batches is also not possible -since decision of backoff is
based on last batch process status but driver has already computed offsets
for next batches - so if I ignore further few batches till external server
is back to normal its a dataloss if I cannot reset the offset .

So only option seems is to delay the last batch by calling sleep() in
foreach rdd method after returning from foreachpartitions transformations.

So concern here is further batches will keep enqueening until current slept
batch completes. So whats the max size of scheduling queue? Say if server
does not come up for hours and my batch size is 5 sec it will enqueue 720
batches .
Will that be a issue ?
 And is there any setting in directkafkastream to enforce not to call
further computes() method after a threshold of scheduling queue size say
(50 batches).Once queue size comes back to less than threshold call compute
and enqueue the next job.





On Tue, Sep 1, 2015 at 8:57 PM, Cody Koeninger  wrote:

> Honestly I'd concentrate more on getting your batches to finish in a
> timely fashion, so you won't even have the issue to begin with...
>
> On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora  > wrote:
>
>> What if I use custom checkpointing. So that I can take care of offsets
>> being checkpointed at end of each batch.
>>
>> Will it be possible then to reset the offset.
>>
>> On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger 
>> wrote:
>>
>>> No, if you start arbitrarily messing around with offset ranges after
>>> compute is called, things are going to get out of whack.
>>>
>>> e.g. checkpoints are no longer going to correspond to what you're
>>> actually processing
>>>
>>> On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
 can I reset the range based on some condition - before calling
 transformations on the stream.

 Say -
 before calling :
  directKafkaStream.foreachRDD(new Function, Void>() {

 @Override
 public Void call(JavaRDD v1) throws Exception {
 v1.foreachPartition(new  VoidFunction>{
 @Override
 public void call(Iterator t) throws Exception {
 }});}});

 change directKafkaStream's RDD's offset range.(fromOffset).

 I can't do this in compute method since compute would have been called
 at current batch queue time - but condition is set at previous batch run
 time.


 On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger 
 wrote:

> It's at the time compute() gets called, which should be near the time
> the batch should have been queued.
>
> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <
> shushantaror...@gmail.com> wrote:
>
>> Hi
>>
>> In spark streaming 1.3 with kafka- when does driver bring latest
>> offsets of this run - at start of each batch or at time when  batch gets
>> queued ?
>>
>> Say few of my batches take longer time to complete than their batch
>> interval. So some of batches will go in queue. Will driver waits for
>>  queued batches to get started or just brings the latest offsets before
>> they even actually started. And when they start running they will work on
>> old offsets brought at time when they were queued.
>>
>>
>

>>>
>>
>


Error when creating an ALS model in spark

2015-09-01 Thread Madawa Soysa
I'm getting the an error when I try to build an ALS model in spark
standalone. I am new to spark. Any help would be appreciated to resolve
this issue.

Stack Trace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, 192.168.0.171): java.io.EOFException
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:186)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1254)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
Sounds like you'd be better off just failing if the external server is
down, and scripting monitoring / restarting of your job.

On Tue, Sep 1, 2015 at 11:19 AM, Shushant Arora 
wrote:

> Since in my app , after processing the events I am posting the events to
> some external server- if external server is down - I want to backoff
> consuming from kafka. But I can't stop and restart the consumer since it
> needs manual effort.
>
> Backing off few batches is also not possible -since decision of backoff is
> based on last batch process status but driver has already computed offsets
> for next batches - so if I ignore further few batches till external server
> is back to normal its a dataloss if I cannot reset the offset .
>
> So only option seems is to delay the last batch by calling sleep() in
> foreach rdd method after returning from foreachpartitions transformations.
>
> So concern here is further batches will keep enqueening until current
> slept batch completes. So whats the max size of scheduling queue? Say if
> server does not come up for hours and my batch size is 5 sec it will
> enqueue 720 batches .
> Will that be a issue ?
>  And is there any setting in directkafkastream to enforce not to call
> further computes() method after a threshold of scheduling queue size say
> (50 batches).Once queue size comes back to less than threshold call compute
> and enqueue the next job.
>
>
>
>
>
> On Tue, Sep 1, 2015 at 8:57 PM, Cody Koeninger  wrote:
>
>> Honestly I'd concentrate more on getting your batches to finish in a
>> timely fashion, so you won't even have the issue to begin with...
>>
>> On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> What if I use custom checkpointing. So that I can take care of offsets
>>> being checkpointed at end of each batch.
>>>
>>> Will it be possible then to reset the offset.
>>>
>>> On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger 
>>> wrote:
>>>
 No, if you start arbitrarily messing around with offset ranges after
 compute is called, things are going to get out of whack.

 e.g. checkpoints are no longer going to correspond to what you're
 actually processing

 On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <
 shushantaror...@gmail.com> wrote:

> can I reset the range based on some condition - before calling
> transformations on the stream.
>
> Say -
> before calling :
>  directKafkaStream.foreachRDD(new Function, Void>() {
>
> @Override
> public Void call(JavaRDD v1) throws Exception {
> v1.foreachPartition(new  VoidFunction>{
> @Override
> public void call(Iterator t) throws Exception {
> }});}});
>
> change directKafkaStream's RDD's offset range.(fromOffset).
>
> I can't do this in compute method since compute would have been called
> at current batch queue time - but condition is set at previous batch run
> time.
>
>
> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger 
> wrote:
>
>> It's at the time compute() gets called, which should be near the time
>> the batch should have been queued.
>>
>> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> In spark streaming 1.3 with kafka- when does driver bring latest
>>> offsets of this run - at start of each batch or at time when  batch gets
>>> queued ?
>>>
>>> Say few of my batches take longer time to complete than their batch
>>> interval. So some of batches will go in queue. Will driver waits for
>>>  queued batches to get started or just brings the latest offsets before
>>> they even actually started. And when they start running they will work 
>>> on
>>> old offsets brought at time when they were queued.
>>>
>>>
>>
>

>>>
>>
>


cached data between jobs

2015-09-01 Thread Eric Walker
Hi,

I'm noticing that a 30 minute job that was initially IO-bound may not be
during subsequent runs.  Is there some kind of between-job caching that
happens in Spark or in Linux that outlives jobs and that might be making
subsequent runs faster?  If so, is there a way to avoid the caching in
order to get a better sense of the worst-case scenario?

(It's also possible that I've simply changed something that made things
faster.)

Eric


Re: Submitted applications does not run.

2015-09-01 Thread Madawa Soysa
Hi Jeff,

I solved the issue by following the given instructions. Thanks for the help.

Regards,
Madawa.

On 1 September 2015 at 14:12, Jeff Zhang  wrote:

> You need to make yourself able to ssh to localhost without password,
> please check this blog.
>
> http://hortonworks.com/kb/generating-ssh-keys-for-passwordless-login/
>
>
>
> On Tue, Sep 1, 2015 at 4:31 PM, Madawa Soysa 
> wrote:
>
>> I used ./sbin/start-master.sh
>>
>> When I used ./sbin/start-all.sh the start fails. I get the following
>> error.
>>
>> failed to launch org.apache.spark.deploy.master.Master:
>> localhost: ssh: connect to host localhost port 22: Connection refused
>>
>> On 1 September 2015 at 13:41, Jeff Zhang  wrote:
>>
>>> Did you start spark cluster using command sbin/start-all.sh ?
>>> You should have 2 log files under folder if it is single-node cluster.
>>> Like the following
>>>
>>> spark-jzhang-org.apache.spark.deploy.master.Master-1-jzhangMBPr.local.out
>>> spark-jzhang-org.apache.spark.deploy.worker.Worker-1-jzhangMBPr.local.out
>>>
>>>
>>>
>>> On Tue, Sep 1, 2015 at 4:01 PM, Madawa Soysa 
>>> wrote:
>>>
 There are no logs which includes apache.spark.deploy.worker in file
 name in the SPARK_HOME/logs folder.

 On 1 September 2015 at 13:00, Jeff Zhang  wrote:

> This is master log. There's no worker registration info in the log.
> That means the worker may not start properly. Please check the log file
> with apache.spark.deploy.worker in file name.
>
>
>
> On Tue, Sep 1, 2015 at 2:55 PM, Madawa Soysa 
> wrote:
>
>> I cannot see anything abnormal in logs. What would be the reason for
>> not availability of executors?
>>
>> On 1 September 2015 at 12:24, Madawa Soysa 
>> wrote:
>>
>>> Following are the logs available. Please find the attached.
>>>
>>> On 1 September 2015 at 12:18, Jeff Zhang  wrote:
>>>
 It's in SPARK_HOME/logs

 Or you can check the spark web ui. http://[master-machine]:8080


 On Tue, Sep 1, 2015 at 2:44 PM, Madawa Soysa <
 madawa...@cse.mrt.ac.lk> wrote:

> How do I check worker logs? SPARK_HOME/work folder does not exist.
> I am using the spark standalone mode.
>
> On 1 September 2015 at 12:05, Jeff Zhang  wrote:
>
>> No executors ? Please check the worker logs if you are using
>> spark standalone mode.
>>
>> On Tue, Sep 1, 2015 at 2:17 PM, Madawa Soysa <
>> madawa...@cse.mrt.ac.lk> wrote:
>>
>>> Hi All,
>>>
>>> I have successfully submitted some jobs to spark master. But the
>>> jobs won't progress and not finishing. Please see the attached 
>>> screenshot.
>>> These are fairly very small jobs and this shouldn't take more than 
>>> a minute
>>> to finish.
>>>
>>> I'm new to spark and any help would be appreciated.
>>>
>>> Thanks,
>>> Madawa.
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
> *_**Madawa Soysa*
>
> Undergraduate,
>
> Department of Computer Science and Engineering,
>
> University of Moratuwa.
>
>
> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
> madawa...@cse.mrt.ac.lk
> LinkedIn  | Twitter
>  | Tumblr
> 
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>
>>> *_**Madawa Soysa*
>>>
>>> Undergraduate,
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> University of Moratuwa.
>>>
>>>
>>> Mobile: +94 71 461 6050 <%2B94%2075%20812%200726> | Email:
>>> madawa...@cse.mrt.ac.lk
>>> LinkedIn  | Twitter
>>>  | Tumblr
>>> 
>>>
>>
>>
>>
>> --
>>
>> *_**Madawa Soysa*
>>
>> Undergraduate,
>>
>> Department of Computer Science and Engineering,
>>
>> University of 

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-01 Thread Anders Arpteg
A fix submitted less than one hour after my mail, very impressive Davies!
I've compiled your PR and tested it with the large job that failed before,
and it seems to work fine now without any exceptions. Awesome, thanks!

Best,
Anders

On Tue, Sep 1, 2015 at 1:38 AM Davies Liu  wrote:

> I had sent out a PR [1] to fix 2), could you help to test that?
>
> [1]  https://github.com/apache/spark/pull/8543
>
> On Mon, Aug 31, 2015 at 12:34 PM, Anders Arpteg 
> wrote:
> > Was trying out 1.5 rc2 and noticed some issues with the Tungsten shuffle
> > manager. One problem was when using the com.databricks.spark.avro reader
> and
> > the error(1) was received, see stack trace below. The problem does not
> occur
> > with the "sort" shuffle manager.
> >
> > Another problem was in a large complex job with lots of transformations
> > occurring simultaneously, i.e. 50+ or more maps each shuffling data.
> > Received error(2) about inability to acquire memory which seems to also
> have
> > to do with Tungsten. Possibly some setting available to increase that
> > memory, because there's lots of heap memory available.
> >
> > Am running on Yarn 2.2 with about 400 executors. Hoping this will give
> some
> > hints for improving the upcoming release, or for me to get some hints to
> fix
> > the problems.
> >
> > Thanks,
> > Anders
> >
> > Error(1)
> >
> > 15/08/31 18:30:57 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
> 3387,
> > lon4-hadoopslave-c245.lon4.spotify.net): java.io.EOFException
> >
> >at java.io.DataInputStream.readInt(DataInputStream.java:392)
> >
> >at
> >
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:121)
> >
> >at
> >
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:109)
> >
> >at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
> >
> >at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >
> >at
> >
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
> >
> >at
> >
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
> >
> >at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >
> >at
> >
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
> >
> >at
> >
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> >
> >at
> >
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$
> 1.org$apache$spark$sql$execution$aggregate$Tung
> >
> > stenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> >
> >at
> >
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> >
> >at
> >
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> >
> >at
> >
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:47)
> >
> >at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> >
> >at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> >
> >at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> >
> >at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> >
> >at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> >
> >at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> >
> >at org.apache.spark.scheduler.Task.run(Task.scala:88)
> >
> >at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > Error(2)
> >
> > 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID
> > 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException:
> Unable
> > to acquire 67108864 bytes of memory
> >
> >at
> >
> org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385)
> >
> >at
> >
> org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435)
> >
> >at
> >
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> >
> >at
> >
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174)
> >
> >at
> >
> 

Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
can I reset the range based on some condition - before calling
transformations on the stream.

Say -
before calling :
 directKafkaStream.foreachRDD(new Function, Void>() {

@Override
public Void call(JavaRDD v1) throws Exception {
v1.foreachPartition(new  VoidFunction>{
@Override
public void call(Iterator t) throws Exception {
}});}});

change directKafkaStream's RDD's offset range.(fromOffset).

I can't do this in compute method since compute would have been called at
current batch queue time - but condition is set at previous batch run time.


On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger  wrote:

> It's at the time compute() gets called, which should be near the time the
> batch should have been queued.
>
> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora 
> wrote:
>
>> Hi
>>
>> In spark streaming 1.3 with kafka- when does driver bring latest offsets
>> of this run - at start of each batch or at time when  batch gets queued ?
>>
>> Say few of my batches take longer time to complete than their batch
>> interval. So some of batches will go in queue. Will driver waits for
>>  queued batches to get started or just brings the latest offsets before
>> they even actually started. And when they start running they will work on
>> old offsets brought at time when they were queued.
>>
>>
>


Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
No, if you start arbitrarily messing around with offset ranges after
compute is called, things are going to get out of whack.

e.g. checkpoints are no longer going to correspond to what you're actually
processing

On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora 
wrote:

> can I reset the range based on some condition - before calling
> transformations on the stream.
>
> Say -
> before calling :
>  directKafkaStream.foreachRDD(new Function, Void>() {
>
> @Override
> public Void call(JavaRDD v1) throws Exception {
> v1.foreachPartition(new  VoidFunction>{
> @Override
> public void call(Iterator t) throws Exception {
> }});}});
>
> change directKafkaStream's RDD's offset range.(fromOffset).
>
> I can't do this in compute method since compute would have been called at
> current batch queue time - but condition is set at previous batch run time.
>
>
> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger  wrote:
>
>> It's at the time compute() gets called, which should be near the time the
>> batch should have been queued.
>>
>> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora > > wrote:
>>
>>> Hi
>>>
>>> In spark streaming 1.3 with kafka- when does driver bring latest offsets
>>> of this run - at start of each batch or at time when  batch gets queued ?
>>>
>>> Say few of my batches take longer time to complete than their batch
>>> interval. So some of batches will go in queue. Will driver waits for
>>>  queued batches to get started or just brings the latest offsets before
>>> they even actually started. And when they start running they will work on
>>> old offsets brought at time when they were queued.
>>>
>>>
>>
>


Re: Memory-efficient successive calls to repartition()

2015-09-01 Thread Aurélien Bellet

Dear Alexis,

Thanks again for your reply. After reading about checkpointing I have 
modified my sample code as follows:


for i in range(1000):
print i
data2=data.repartition(50).cache()
if (i+1) % 10 == 0:
data2.checkpoint()
data2.first() # materialize rdd
data.unpersist() # unpersist previous version
data=data2

The data is checkpointed every 10 iterations to a directory that I 
specified. While this seems to improve things a little bit, there is 
still a lot of writing on disk (appcache directory, shown as "non HDFS 
files" in Cloudera Manager) *besides* the checkpoint files (which are 
regular HDFS files), and the application eventually runs out of disk 
space. The same is true even if I checkpoint at every iteration.


What am I doing wrong? Maybe some garbage collector setting?

Thanks a lot for the help,

Aurelien

Le 24/08/2015 10:39, alexis GILLAIN a écrit :

Hi Aurelien,

The first code should create a new RDD in memory at each iteration
(check the webui).
The second code will unpersist the RDD but that's not the main problem.

I think you have trouble due to long lineage as .cache() keep track of
lineage for recovery.
You should have a look at checkpointing :
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointer.scala

You can also have a look at the code of others iterative algorithms in
mlllib for best practices.

2015-08-20 17:26 GMT+08:00 abellet >:

Hello,

For the need of my application, I need to periodically "shuffle" the
data
across nodes/partitions of a reasonably-large dataset. This is an
expensive
operation but I only need to do it every now and then. However it
seems that
I am doing something wrong because as the iterations go the memory usage
increases, causing the job to spill onto HDFS, which eventually gets
full. I
am also getting some "Lost executor" errors that I don't get if I don't
repartition.

Here's a basic piece of code which reproduces the problem:

data = sc.textFile("ImageNet_gist_train.txt",50).map(parseLine).cache()
data.count()
for i in range(1000):
 data=data.repartition(50).persist()
 # below several operations are done on data


What am I doing wrong? I tried the following but it doesn't solve
the issue:

for i in range(1000):
 data2=data.repartition(50).persist()
 data2.count() # materialize rdd
 data.unpersist() # unpersist previous version
 data=data2


Help and suggestions on this would be greatly appreciated! Thanks a lot!




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Memory-efficient-successive-calls-to-repartition-tp24358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming 1.3 with kafka

2015-09-01 Thread Shushant Arora
What if I use custom checkpointing. So that I can take care of offsets
being checkpointed at end of each batch.

Will it be possible then to reset the offset.

On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger  wrote:

> No, if you start arbitrarily messing around with offset ranges after
> compute is called, things are going to get out of whack.
>
> e.g. checkpoints are no longer going to correspond to what you're actually
> processing
>
> On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora  > wrote:
>
>> can I reset the range based on some condition - before calling
>> transformations on the stream.
>>
>> Say -
>> before calling :
>>  directKafkaStream.foreachRDD(new Function, Void>() {
>>
>> @Override
>> public Void call(JavaRDD v1) throws Exception {
>> v1.foreachPartition(new  VoidFunction>{
>> @Override
>> public void call(Iterator t) throws Exception {
>> }});}});
>>
>> change directKafkaStream's RDD's offset range.(fromOffset).
>>
>> I can't do this in compute method since compute would have been called at
>> current batch queue time - but condition is set at previous batch run time.
>>
>>
>> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger 
>> wrote:
>>
>>> It's at the time compute() gets called, which should be near the time
>>> the batch should have been queued.
>>>
>>> On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
 Hi

 In spark streaming 1.3 with kafka- when does driver bring latest
 offsets of this run - at start of each batch or at time when  batch gets
 queued ?

 Say few of my batches take longer time to complete than their batch
 interval. So some of batches will go in queue. Will driver waits for
  queued batches to get started or just brings the latest offsets before
 they even actually started. And when they start running they will work on
 old offsets brought at time when they were queued.


>>>
>>
>


Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
It's at the time compute() gets called, which should be near the time the
batch should have been queued.

On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora 
wrote:

> Hi
>
> In spark streaming 1.3 with kafka- when does driver bring latest offsets
> of this run - at start of each batch or at time when  batch gets queued ?
>
> Say few of my batches take longer time to complete than their batch
> interval. So some of batches will go in queue. Will driver waits for
>  queued batches to get started or just brings the latest offsets before
> they even actually started. And when they start running they will work on
> old offsets brought at time when they were queued.
>
>


Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Hello everybody,

I am trying to read the Google Books Ngrams with pyspark on Amazon EC2. 

I followed the steps from :
http://spark.apache.org/docs/latest/ec2-scripts.html

and everything is working fine.

I am able to read the file  :
lines =
sc.textFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram")
lines.first()
u'SEQ\x06!org.apache.hadoop.io.LongWritable\x19org.apache.hadoop.io.Text\x01\x01#com.hadoop.compression.lzo.LzoCodec
\x00\x00\x00\x00\ufffd

If I now want to read the file using :

lines =
sc.sequenceFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram")

I have the following error message :

15/09/01 15:28:51 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0
(TID 1, 172.31.61.41): java.lang.IllegalArgumentException: Unknown codec:
com.hadoop.compression.lzo.LzoCodec
Traceback (most recent call last):
  File "", line 1, in 
  File "/root/spark/python/pyspark/context.py", line 544, in sequenceFile
keyConverter, valueConverter, minSplits, batchSize)
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.sequenceFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
(TID 4, 172.31.61.41): java.lang.IllegalArgumentException: Unknown codec:
com.hadoop.compression.lzo.LzoCodec


Could you please help me reading the file with pyspark ?

Thank you for your help,

Cheers,

Bertrand



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1-tp24542.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming 1.3 with kafka

2015-09-01 Thread Cody Koeninger
Honestly I'd concentrate more on getting your batches to finish in a timely
fashion, so you won't even have the issue to begin with...

On Tue, Sep 1, 2015 at 10:16 AM, Shushant Arora 
wrote:

> What if I use custom checkpointing. So that I can take care of offsets
> being checkpointed at end of each batch.
>
> Will it be possible then to reset the offset.
>
> On Tue, Sep 1, 2015 at 8:42 PM, Cody Koeninger  wrote:
>
>> No, if you start arbitrarily messing around with offset ranges after
>> compute is called, things are going to get out of whack.
>>
>> e.g. checkpoints are no longer going to correspond to what you're
>> actually processing
>>
>> On Tue, Sep 1, 2015 at 10:04 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> can I reset the range based on some condition - before calling
>>> transformations on the stream.
>>>
>>> Say -
>>> before calling :
>>>  directKafkaStream.foreachRDD(new Function, Void>() {
>>>
>>> @Override
>>> public Void call(JavaRDD v1) throws Exception {
>>> v1.foreachPartition(new  VoidFunction>{
>>> @Override
>>> public void call(Iterator t) throws Exception {
>>> }});}});
>>>
>>> change directKafkaStream's RDD's offset range.(fromOffset).
>>>
>>> I can't do this in compute method since compute would have been called
>>> at current batch queue time - but condition is set at previous batch run
>>> time.
>>>
>>>
>>> On Tue, Sep 1, 2015 at 7:09 PM, Cody Koeninger 
>>> wrote:
>>>
 It's at the time compute() gets called, which should be near the time
 the batch should have been queued.

 On Tue, Sep 1, 2015 at 8:02 AM, Shushant Arora <
 shushantaror...@gmail.com> wrote:

> Hi
>
> In spark streaming 1.3 with kafka- when does driver bring latest
> offsets of this run - at start of each batch or at time when  batch gets
> queued ?
>
> Say few of my batches take longer time to complete than their batch
> interval. So some of batches will go in queue. Will driver waits for
>  queued batches to get started or just brings the latest offsets before
> they even actually started. And when they start running they will work on
> old offsets brought at time when they were queued.
>
>

>>>
>>
>


Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
The web ui is at port 8080. 4040 will show up something when you have a
running job or if you have configured history server.
On Sep 1, 2015 8:57 PM, "Sunil Rathee"  wrote:

>
> Hi,
>
>
> localhost:4040 is not showing anything on the browser. Do we have to start
> some service?
>
> --
>
>
> Sunil Rathee
>
>
>
>


Re: Web UI is not showing up

2015-09-01 Thread Sunil Rathee
localhost:8080 is also not showing anything. Does some application running
at the same time?

On Tue, Sep 1, 2015 at 9:04 PM, Sonal Goyal  wrote:

> The web ui is at port 8080. 4040 will show up something when you have a
> running job or if you have configured history server.
> On Sep 1, 2015 8:57 PM, "Sunil Rathee"  wrote:
>
>>
>> Hi,
>>
>>
>> localhost:4040 is not showing anything on the browser. Do we have to
>> start some service?
>>
>> --
>>
>>
>> Sunil Rathee
>>
>>
>>
>>


-- 


Sunil Rathee


Re: Is it possible to create spark cluster in different network?

2015-09-01 Thread Max Huang
Thanks for reply my topic

Yes, I try SPARK_MASTER_IP and SPARK_LOCAL_IP in spark-env.sh

^^

2015-09-01 5:47 GMT-05:00 Akhil Das :

> Did you try with SPARK_LOCAL_IP?
>
> Thanks
> Best Regards
>
> On Tue, Sep 1, 2015 at 12:29 AM, sakana  wrote:
>
>> Hi
>>
>> I am successful create Spark cluster in openStack.
>> I want to create spark cluster in different openStack sites.
>>
>> In openstack, if you create instance, it only know it's private ip ( like
>> 10.x.y.z ), it will not know it have public IP for itself. ( I try to
>> export
>> SPARK_MASTER_IP=xxx.xxx.xxx.xxx with public ip, but log show it can't bind
>> it )
>>
>>
>> so, the port 7070 is listen on private IP address , and refuse connect for
>> other network ip
>>
>> $ netstat -tupln
>> (Not all processes could be identified, non-owned process info
>>  will not be shown, you would have to be root to see it all.)
>> Active Internet connections (only servers)
>> Proto Recv-Q Send-Q Local Address   Foreign Address State
>> PID/Program name
>> tcp0  0 0.0.0.0:21  0.0.0.0:*
>>  LISTEN
>> -
>> tcp0  0 0.0.0.0:50070   0.0.0.0:*
>>  LISTEN
>> 12921/java
>> tcp0  0 0.0.0.0:22  0.0.0.0:*
>>  LISTEN
>> -
>> tcp0  0 10.103.0.67:90000.0.0.0:*
>>  LISTEN
>> 12921/java
>> tcp0  0 0.0.0.0:50090   0.0.0.0:*
>>  LISTEN
>> 13163/java
>> tcp6   0  0 :::22   :::*LISTEN
>> -
>> tcp6   0  0 10.103.0.67:7077:::*
>> LISTEN
>> 25488/java
>> tcp6   0  0 :::8080 :::*LISTEN
>> 25488/java
>> tcp6   0  0 127.255.255.1:6066  :::*
>> LISTEN
>> 25488/java
>> udp0  0 0.0.0.0:64062   0.0.0.0:*
>> -
>> udp0  0 0.0.0.0:68  0.0.0.0:*
>> -
>> udp6   0  0 :::59067:::*
>>
>> Is there any possible listen 0.0.0.0:7077 with spark?
>>
>> I want the other slaves to connect master with public IP in openstack.
>>
>>
>>
>> Thanks for kindly help
>>
>>
>>
>> Max
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-create-spark-cluster-in-different-network-tp24524.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Web UI is not showing up

2015-09-01 Thread Sunil Rathee
Hi,


localhost:4040 is not showing anything on the browser. Do we have to start
some service?

-- 


Sunil Rathee


Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
Is your master up? Check the java processes to see if they are running.

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier covered in YourStory 
Reifier at Spark Summit 2015






On Tue, Sep 1, 2015 at 9:06 PM, Sunil Rathee 
wrote:

> localhost:8080 is also not showing anything. Does some application running
> at the same time?
>
> On Tue, Sep 1, 2015 at 9:04 PM, Sonal Goyal  wrote:
>
>> The web ui is at port 8080. 4040 will show up something when you have a
>> running job or if you have configured history server.
>> On Sep 1, 2015 8:57 PM, "Sunil Rathee"  wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>> localhost:4040 is not showing anything on the browser. Do we have to
>>> start some service?
>>>
>>> --
>>>
>>>
>>> Sunil Rathee
>>>
>>>
>>>
>>>
>
>
> --
>
>
> Sunil Rathee
>
>
>
>


What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by
queries and I am running into OOM issues. So thinking of increasing value of
spark.sql.shuffle.partition from 200 default to 1000 but it is not helping.
Please correct me if I am wrong this partitions will share data shuffle load
so more the partitions less data to hold. Please guide I am new to Spark. I
am using Spark 1.4.0 and I have around 1TB of uncompressed data to process
using hiveContext.sql() group by queries.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org