No. of Task vs No. of Executors

2015-07-14 Thread shahid
hi 

I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
partitions i get is 9. I am running a spark application , it gets stuck on
one of tasks, looking at the UI it seems application is not using all nodes
to do calculations. attached is the screen shot of tasks, it seems tasks are
put on each node more then once. looking at tasks 8 tasks get completed
under 7-8 minutes and one task takes around 30 minutes so causing the delay
in results. 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How does shuffle work in spark ?

2015-10-19 Thread shahid
@all i did partitionby using default hash partitioner on data
[(1,data)(2,(data),(n,data)]
the total data was approx 3.5 it showed shuffle write 50G and on next action
e.g count it is showing shuffle read of 50 G. i don't understand this
behaviour and i think the performance is getting slow with so much shuffle
read on next tranformation operations.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-tp584p25119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Can we split partition

2015-10-20 Thread shahid
Hi 

I have a large partition(data skewed) i need to split it to no. of
partitions, repartitioning causes lot of shuffle. Can we do that..?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-split-partition-tp25151.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Python worker exited unexpectedly (crashed)

2015-10-22 Thread shahid
Hi 

I am running 10 node standalone cluster on aws
and loading 100G data on HDFS.. doing first groupby operation.
and then generating pairs from the groupedrdd (key,[a1,b1],key,[a,b,c]) 
generating the pairs like
(a1,b1),(a,b),(a,c) ... n
PairRDD will get large in size.

some stats from ui when starting to get errors and finally script fails
Details for Stage 1 (Attempt 0)
Total Time Across All Tasks: 1.3 h
Shuffle Read: 4.4 GB / 1402058
Shuffle Spill (Memory): 73.1 GB
Shuffle Spill (Disk): 3.6 GB

Get following stack trace 

WARN scheduler.TaskSetManager: Lost task 0.3 in stage 1.0 (TID 943,
10.239.131.154): org.apache.spark.SparkException: Python worker exited
unexpectedly (crashed)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:175)
at
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:111)
... 10 more

15/10/22 16:30:17 ERROR scheduler.TaskSetManager: Task 0 in stage 1.0 failed
4 times; aborting job
15/10/22 16:30:17 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-worker-exited-unexpectedly-crashed-tp25164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



DAG info

2015-01-01 Thread shahid
hi guys


i have just starting using spark, i am getting this as an info
15/01/02 11:54:17 INFO DAGScheduler: Parents of final stage: List()
15/01/02 11:54:17 INFO DAGScheduler: Missing parents: List()
15/01/02 11:54:17 INFO DAGScheduler: Submitting Stage 6 (PythonRDD[12] at
RDD at PythonRDD.scala:43), which has no missing parents

Also my program is taking lot of time to execute.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



problem while running code

2015-01-05 Thread shahid
the log is here py4j.protocol.Py4JError: An error occurred while calling
o22.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:695)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-while-running-code-tp20978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid
INFO scheduler.TaskSetManager: Starting task 2.1 in stage 2.0 (TID 9,
ip-10-80-98-118.ec2.internal, PROCESS_LOCAL, 1025 bytes)
15/02/10 15:54:08 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 2.0
(TID 6) on executor ip-10-80-15-145.ec2.internal:
org.apache.spark.SparkException (Data of type java.util.ArrayList cannot be
used) [duplicate 1]
15/02/10 15:54:08 INFO scheduler.TaskSetManager: Starting task 1.1 in stage
2.0 (TID 10, ip-10-80-15-145.ec2.internal, PROCESS_LOCAL, 1025 bytes)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-when-trying-to-use-EShadoop-connector-and-writing-rdd-to-ES-tp21579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



getting this error while runing

2015-02-28 Thread shahid

conf =
SparkConf().setAppName("spark_calc3merged").setMaster("spark://ec2-54-145-68-13.compute-1.amazonaws.com:7077")
sc =
SparkContext(conf=conf,pyFiles=["/root/platinum.py","/root/collections2.py"])
  
15/02/28 19:06:38 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3.0
(TID 38, ip-10-80-15-145.ec2.internal):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 73065
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:156)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/02/28 19:06:38 INFO scheduler.TaskSetManager: Starting task 5.1 in stage
3.0 (TID 44, ip-10-80-15-145.ec2.internal, NODE_LOCAL, 1502 bytes)
15/02/28 19:06:38 INFO scheduler.TaskSetManager: Finished task 8.0 in stage
3.0 (TID 41) in 7040 ms on ip-10-80-98-118.ec2.internal (9/11)
15/02/28 19:06:38 INFO scheduler.TaskSetManager: Finished task 9.0 in stage
3.0 (TID 42) in 7847 ms on ip-10-80-15-145.ec2.internal (10/11)
15/02/28 19:06:50 WARN scheduler.TaskSetManager: Lost task 5.1 in stage 3.0
(TID 44, ip-10-80-15-145.ec2.internal):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 73065
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:156)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/02/28 19:06:50 INFO scheduler.TaskSetManager: Starting task 5.2 in stage
3.0 (TID 45, ip-10-80-98-118.ec2.internal, NODE_LOCAL, 1502 bytes)
15/02/28 19:07:01 WARN scheduler.TaskSetManager: Lost task 5.2 in stage 3.0
(TID 45, ip-10-80-98-118.ec2.internal):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 73065
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
   
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:156)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   

Re: getting this error while runing

2015-02-28 Thread shahid
Also the data file is on hdfs



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/getting-this-error-while-runing-tp21860p21861.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Research ideas using spark

2015-07-15 Thread shahid ashraf
Sorry Guys!

I mistakenly added my question to this thread( Research ideas using spark).
Moreover people can ask any question , this spark user group is for that.

Cheers!
😊

On Wed, Jul 15, 2015 at 9:43 PM, Robin East  wrote:

> Well said Will. I would add that you might want to investigate GraphChi
> which claims to be able to run a number of large-scale graph processing
> tasks on a workstation much quicker than a very large Hadoop cluster. It
> would be interesting to know how widely applicable the approach GraphChi
> takes and what implications it has for parallel/distributed computing
> approaches. A rich seam to mine indeed.
>
> Robin
>
> On 15 Jul 2015, at 14:48, William Temperley 
> wrote:
>
> There seems to be a bit of confusion here - the OP (doing the PhD) had the
> thread hijacked by someone with a similar name asking a mundane question.
>
> It would be a shame to send someone away so rudely, who may do valuable
> work on Spark.
>
> Sashidar (not Sashid!) I'm personally interested in running graph
> algorithms for image segmentation using MLib and Spark.  I've got many
> questions though - like is it even going to give me a speed-up?  (
> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>
> It's not obvious to me which classes of graph algorithms can be
> implemented correctly and efficiently in a highly parallel manner.  There's
> tons of work to be done here, I'm sure. Also, look at parallel geospatial
> algorithms - there's a lot of work being done on this.
>
> Best, Will
>
>
>
> On 15 July 2015 at 09:01, Vineel Yalamarthy 
> wrote:
>
>> Hi Daniel
>>
>> Well said
>>
>> Regards
>> Vineel
>>
>> On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> Hi Shahid,
>>> To be honest I think this question is better suited for Stack Overflow
>>> than for a PhD thesis.
>>>
>>> On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf 
>>> wrote:
>>>
>>>> hi
>>>>
>>>> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
>>>> partitions i get is 9. I am running a spark application , it gets stuck on
>>>> one of tasks, looking at the UI it seems application is not using all nodes
>>>> to do calculations. attached is the screen shot of tasks, it seems tasks
>>>> are put on each node more then once. looking at tasks 8 tasks get completed
>>>> under 7-8 minutes and one task takes around 30 minutes so causing the delay
>>>> in results.
>>>>
>>>>
>>>> On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao <
>>>> raoshashidhar...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am doing my PHD thesis on large scale machine learning e.g  Online
>>>>> learning, batch and mini batch learning.
>>>>>
>>>>> Could somebody help me with ideas especially in the context of Spark
>>>>> and to the above learning methods.
>>>>>
>>>>> Some ideas like improvement to existing algorithms, implementing new
>>>>> features especially the above learning methods and algorithms that have 
>>>>> not
>>>>> been implemented etc.
>>>>>
>>>>> If somebody could help me with some ideas it would really accelerate
>>>>> my work.
>>>>>
>>>>> Plus few ideas on research papers regarding Spark or Mahout.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Regards
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> with Regards
>>>> Shahid Ashraf
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>>
>
>


-- 
with Regards
Shahid Ashraf


Re: No. of Task vs No. of Executors

2015-07-21 Thread shahid ashraf
Thanks All!

thanks Ayan!

I did the repartition to 20 so it used all cores in the cluster and was
done in 3 minutes. seems data was skewed to this partition.



On Tue, Jul 14, 2015 at 8:05 PM, ayan guha  wrote:

> Hi
>
> As you can see, Spark has taken data locality into consideration and thus
> scheduled all tasks as node local. It is because spark could run task on a
> node where data is present, so spark went ahead and scheduled the tasks. It
> is actually good for reading. If you really want to fan out processing, you
> may do a repartition(n).
> Regarding slowness, as you can see another task has completed successfully
> in 6 mins in Excutor id 2.So it does not seem that node itself is slow. it
> is possible the computation for one node is skewed. you may want to switch
> on speculative execution to see if the same task gets completed in other
> node faster or not. If yes, then its a node issue, else, ost ikely data
> issue
>
> On Tue, Jul 14, 2015 at 11:43 PM, shahid  wrote:
>
>> hi
>>
>> I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
>> partitions i get is 9. I am running a spark application , it gets stuck on
>> one of tasks, looking at the UI it seems application is not using all
>> nodes
>> to do calculations. attached is the screen shot of tasks, it seems tasks
>> are
>> put on each node more then once. looking at tasks 8 tasks get completed
>> under 7-8 minutes and one task takes around 30 minutes so causing the
>> delay
>> in results.
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -----
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
with Regards
Shahid Ashraf


Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread shahid ashraf
>>>
>>> In addition, the Solr code sets the following additional config
>> parameters on the DefaultHttpClient.
>>
>>   params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 128);
>>>   params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 32);
>>>   params.set(HttpClientUtil.PROP_FOLLOW_REDIRECTS, followRedirects);
>>
>> Since all my connections are coming out of 2 worker boxes, it looks like
>> I could get 32x2 = 64 clients hitting Solr, right?
>>
>> @Steve: Thanks for the link to the HttpClient config. I was thinking
>> about using a thread pool (or better using a PoolingHttpClientManager per
>> the docs), but it probably won't help since its still being fed one request
>> at a time.
>> @Abhishek: my observations agree with what you said. In the past I have
>> had success with repartition to reduce the partition size especially when
>> groupBy operations were involved. But I believe an executor should be able
>> to handle multiple tasks in parallel from what I understand about Akka on
>> which Spark is built - the worker is essentially an ActorSystem which can
>> contain multiple Actors, each actor works on a queue of tasks. Within an
>> Actor everything is sequential, but the ActorSystem is responsible for
>> farming out tasks it gets to each of its Actors. Although it is possible I
>> could be generalizing incorrectly from my limited experience with Akka.
>>
>> Thanks again for all your help. Please let me know if something jumps out
>> and/or if there is some configuration I should check.
>>
>> -sujit
>>
>>
>>
>> On Sun, Aug 2, 2015 at 6:13 PM, Abhishek R. Singh <
>> abhis...@tetrationanalytics.com> wrote:
>>
>>> I don't know if (your assertion/expectation that) workers will process
>>> things (multiple partitions) in parallel is really valid. Or if having more
>>> partitions than workers will necessarily help (unless you are memory bound
>>> - so partitions is essentially helping your work size rather than execution
>>> parallelism).
>>>
>>> [Disclaimer: I am no authority on Spark, but wanted to throw my spin
>>> based my own understanding].
>>>
>>> Nothing official about it :)
>>>
>>> -abhishek-
>>>
>>> On Jul 31, 2015, at 1:03 PM, Sujit Pal  wrote:
>>>
>>> Hello,
>>>
>>> I am trying to run a Spark job that hits an external webservice to get
>>> back some information. The cluster is 1 master + 4 workers, each worker has
>>> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server,
>>> and is accessed using code similar to that shown below.
>>>
>>> def getResults(keyValues: Iterator[(String, Array[String])]):
>>>> Iterator[(String, String)] = {
>>>> val solr = new HttpSolrClient()
>>>> initializeSolrParameters(solr)
>>>> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>>>> }
>>>> myRDD.repartition(10)
>>>
>>>  .mapPartitions(keyValues => getResults(keyValues))
>>>>
>>>
>>> The mapPartitions does some initialization to the SolrJ client per
>>> partition and then hits it for each record in the partition via the
>>> getResults() call.
>>>
>>> I repartitioned in the hope that this will result in 10 clients hitting
>>> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
>>> clients if I can). However, I counted the number of open connections using
>>> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
>>> observed that Solr has a constant 4 clients (ie, equal to the number of
>>> workers) over the lifetime of the run.
>>>
>>> My observation leads me to believe that each worker processes a single
>>> stream of work sequentially. However, from what I understand about how
>>> Spark works, each worker should be able to process number of tasks
>>> parallelly, and that repartition() is a hint for it to do so.
>>>
>>> Is there some SparkConf environment variable I should set to increase
>>> parallelism in these workers, or should I just configure a cluster with
>>> multiple workers per machine? Or is there something I am doing wrong?
>>>
>>> Thank you in advance for any pointers you can provide.
>>>
>>> -sujit
>>>
>>>
>>
>


-- 
with Regards
Shahid Ashraf


Re: Spark ec2 lunch problem

2015-08-21 Thread shahid ashraf
Does the cluster work at the end ?

On Fri, Aug 21, 2015 at 8:25 PM, Garry Chen  wrote:

> Hi All,
>
> I am trying to lunch a spark ec2 cluster by running
>  spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc
> --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cluster but
> getting following message endless.  Please help.
>
>
>
>
>
> Warning: SSH connection error. (This could be temporary.)
>
> Host:
>
> SSH return code: 255
>
> SSH output: ssh: Could not resolve hostname : Name or service not known
>



-- 
with Regards
Shahid Ashraf


How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
I would like to implement sorted neighborhood approach in spark, what is the 
best way to write that in pyspark.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
Any resources on this

> On Aug 25, 2015, at 3:15 PM, shahid qadri  wrote:
> 
> I would like to implement sorted neighborhood approach in spark, what is the 
> best way to write that in pyspark.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to effieciently write sorted neighborhood in pyspark

2015-09-01 Thread shahid qadri

> On Aug 25, 2015, at 10:43 PM, shahid qadri  wrote:
> 
> Any resources on this
> 
>> On Aug 25, 2015, at 3:15 PM, shahid qadri  wrote:
>> 
>> I would like to implement sorted neighborhood approach in spark, what is the 
>> best way to write that in pyspark.
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Custom Partitioner

2015-09-01 Thread shahid qadri
Hi Sparkians

How can we create a customer partition in pyspark

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
Hi

I did not get this, e.g if i need to create a custom partitioner like range
partitioner.

On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:

> Hi,
>
> You just need to extend Partitioner and override the numPartitions and
> getPartition methods, see below
>
> class MyPartitioner extends partitioner {
>   def numPartitions: Int = // Return the number of partitions
>   def getPartition(key Any): Int = // Return the partition for a given key
> }
>
> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
> wrote:
>
>> Hi Sparkians
>>
>> How can we create a customer partition in pyspark
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
with Regards
Shahid Ashraf


Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
Hi

I think range partitioner is not available in pyspark, so if we want create
one. how should we create that. my question is that.

On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker  wrote:

> Ah sorry I miss read your question. In pyspark it looks like you just need
> to instantiate the Partitioner class with numPartitions and partitionFunc.
>
> On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf  wrote:
>
>> Hi
>>
>> I did not get this, e.g if i need to create a custom partitioner like
>> range partitioner.
>>
>> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker  wrote:
>>
>>> Hi,
>>>
>>> You just need to extend Partitioner and override the numPartitions and
>>> getPartition methods, see below
>>>
>>> class MyPartitioner extends partitioner {
>>>   def numPartitions: Int = // Return the number of partitions
>>>   def getPartition(key Any): Int = // Return the partition for a given
>>> key
>>> }
>>>
>>> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri 
>>> wrote:
>>>
>>>> Hi Sparkians
>>>>
>>>> How can we create a customer partition in pyspark
>>>>
>>>> -----
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>
>>
>> --
>> with Regards
>> Shahid Ashraf
>>
>


-- 
with Regards
Shahid Ashraf


Re: Custom Partitioner

2015-09-02 Thread shahid ashraf
yes i can take as an example , but my actual use case is that in need to
resolve a data skew, when i do grouping based on key(A-Z) the resulting
partitions are skewed like
(partition no.,no_of_keys, total elements with given key)
<< partition: [(0, 0, 0), (1, 15, 17395), (2, 0, 0), (3, 0, 0), (4, 13,
18196), (5, 0, 0), (6, 0, 0), (7, 0, 0), (8, 1, 1), (9, 0, 0)] and
elements: >>
the data has been skewed to partition 1 and 4, i need to split the
partition. and do processing on split partitions and i should be able to
combine splitted partition back also.

On Tue, Sep 1, 2015 at 10:42 PM, Davies Liu  wrote:

> You can take the sortByKey as example:
> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642
>
> On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker  wrote:
> > something like...
> >
> > class RangePartitioner(Partitioner):
> > def __init__(self, numParts):
> > self.numPartitions = numParts
> > self.partitionFunction = rangePartition
> > def rangePartition(key):
> > # Logic to turn key into a partition id
> > return id
> >
> > On Tue, Sep 1, 2015 at 11:38 AM shahid ashraf  wrote:
> >>
> >> Hi
> >>
> >> I think range partitioner is not available in pyspark, so if we want
> >> create one. how should we create that. my question is that.
> >>
> >> On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker 
> wrote:
> >>>
> >>> Ah sorry I miss read your question. In pyspark it looks like you just
> >>> need to instantiate the Partitioner class with numPartitions and
> >>> partitionFunc.
> >>>
> >>> On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf 
> wrote:
> >>>>
> >>>> Hi
> >>>>
> >>>> I did not get this, e.g if i need to create a custom partitioner like
> >>>> range partitioner.
> >>>>
> >>>> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker 
> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> You just need to extend Partitioner and override the numPartitions
> and
> >>>>> getPartition methods, see below
> >>>>>
> >>>>> class MyPartitioner extends partitioner {
> >>>>>   def numPartitions: Int = // Return the number of partitions
> >>>>>   def getPartition(key Any): Int = // Return the partition for a
> given
> >>>>> key
> >>>>> }
> >>>>>
> >>>>> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri <
> shahidashr...@icloud.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Sparkians
> >>>>>>
> >>>>>> How can we create a customer partition in pyspark
> >>>>>>
> >>>>>>
> -
> >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>>>>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> with Regards
> >>>> Shahid Ashraf
> >>
> >>
> >>
> >>
> >> --
> >> with Regards
> >> Shahid Ashraf
>



-- 
with Regards
Shahid Ashraf


Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread shahid ashraf
current.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/09/02 21:12:43 INFO SparkContext: Invoking stop() from shutdown hook
15/09/02 21:12:43 INFO TaskSchedulerImpl: Cancelling stage 10
15/09/02 21:12:43 INFO Executor: Executor is trying to kill task 4.0 in
stage 10.0 (TID 64)
15/09/02 21:12:43 INFO TaskSchedulerImpl: Stage 10 was cancelled
15/09/02 21:12:43 INFO DAGScheduler: ShuffleMapStage 10 (repartition at
NativeMethodAccessorImpl.java:-2) failed in 102.132 s
15/09/02 21:12:43 INFO DAGScheduler: Job 4 failed: collect at
/Users/shahid/projects/spark_rl/record_linker_spark.py:74, took 102.154710 s
Traceback (most recent call last):
  File "/Users/shahid/projects/spark_rl/record_linker_spark.py", line 121,
in 
15/09/02 21:12:43 INFO SparkUI: Stopped Spark web UI at
http://192.168.1.2:4040
15/09/02 21:12:43 INFO DAGScheduler: Stopping DAGScheduler
15/09/02 21:12:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/09/02 21:12:43 INFO Utils: path =
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljcgn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e/blockmgr-e8e82c2e-ca87-4667-a330-b1edb83aa81f,
already present as root for deletion.
15/09/02 21:12:43 INFO MemoryStore: MemoryStore cleared
15/09/02 21:12:43 INFO BlockManager: BlockManager stopped
15/09/02 21:12:43 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/02 21:12:43 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
15/09/02 21:12:43 WARN PythonRDD: Incomplete task interrupted: Attempting
to kill Python Worker
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
R.print_no_elements(mathes_grp)
  File "/Users/shahid/projects/spark_rl/record_linker_spark.py", line 74,
in print_no_elements
print "\n* * * * * * * * * * * * * * *\n<< partition: %s and elements:
>> \n* * * * * * * * * * * * * * *\n" %
rdd.mapPartitionsWithIndex(self.index_no_elements).collect()
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
line 757, in collect
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage
10.0 (TID 61, localhost): java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:113)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anon

ERROR WHILE REPARTITION

2015-09-02 Thread shahid ashraf
dPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/09/02 21:12:43 INFO SparkContext: Invoking stop() from shutdown hook
15/09/02 21:12:43 INFO TaskSchedulerImpl: Cancelling stage 10
15/09/02 21:12:43 INFO Executor: Executor is trying to kill task 4.0 in
stage 10.0 (TID 64)
15/09/02 21:12:43 INFO TaskSchedulerImpl: Stage 10 was cancelled
15/09/02 21:12:43 INFO DAGScheduler: ShuffleMapStage 10 (repartition at
NativeMethodAccessorImpl.java:-2) failed in 102.132 s
15/09/02 21:12:43 INFO DAGScheduler: Job 4 failed: collect at
/Users/shahid/projects/spark_rl/record_linker_spark.py:74, took 102.154710 s
Traceback (most recent call last):
  File "/Users/shahid/projects/spark_rl/record_linker_spark.py", line 121,
in 
15/09/02 21:12:43 INFO SparkUI: Stopped Spark web UI at
http://192.168.1.2:4040
15/09/02 21:12:43 INFO DAGScheduler: Stopping DAGScheduler
15/09/02 21:12:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/09/02 21:12:43 INFO Utils: path =
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljcgn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e/blockmgr-e8e82c2e-ca87-4667-a330-b1edb83aa81f,
already present as root for deletion.
15/09/02 21:12:43 INFO MemoryStore: MemoryStore cleared
15/09/02 21:12:43 INFO BlockManager: BlockManager stopped
15/09/02 21:12:43 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/02 21:12:43 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
15/09/02 21:12:43 WARN PythonRDD: Incomplete task interrupted: Attempting
to kill Python Worker
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
R.print_no_elements(mathes_grp)
  File "/Users/shahid/projects/spark_rl/record_linker_spark.py", line 74,
in print_no_elements
print "\n* * * * * * * * * * * * * * *\n<< partition: %s and elements:
>> \n* * * * * * * * * * * * * * *\n" %
rdd.mapPartitionsWithIndex(self.index_no_elements).collect()
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
line 757, in collect
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage
10.0 (TID 61, localhost): java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:113)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
<http://org.apache.spark.scheduler.dagscheduler.org/>
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun

INDEXEDRDD in PYSPARK

2015-09-03 Thread shahid ashraf
Hi Folks

Any resource to get started using https://github.com/amplab/spark-indexedrdd
in pyspark

-- 
with Regards
Shahid Ashraf


API to run spark Jobs

2015-10-06 Thread shahid qadri
Hi Folks

How i can submit my spark app(python) to the cluster without using 
spark-submit, actually i need to invoke jobs from UI
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: API to run spark Jobs

2015-10-06 Thread shahid qadri
hi Jeff 
Thanks
More specifically i need the Rest api to submit pyspark job, can you point me 
to Spark submit  REST api

> On Oct 6, 2015, at 10:25 PM, Jeff Nadler  wrote:
> 
> 
> Spark standalone doesn't come with a UI for submitting jobs.   Some Hadoop 
> distros might, for example EMR in AWS has a job submit UI.
> 
> Spark submit just calls a REST api, you could build any UI you want on top of 
> that...
> 
> 
> On Tue, Oct 6, 2015 at 9:37 AM, shahid qadri  <mailto:shahidashr...@icloud.com>> wrote:
> Hi Folks
> 
> How i can submit my spark app(python) to the cluster without using 
> spark-submit, actually i need to invoke jobs from UI
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Build Failure

2015-10-08 Thread shahid qadri
hi 

I tried to build latest master branch of spark
build/mvn -DskipTests clean package


Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [03:46 min]
[INFO] Spark Project Test Tags  SUCCESS [01:02 min]
[INFO] Spark Project Launcher . SUCCESS [01:03 min]
[INFO] Spark Project Networking ... SUCCESS [ 30.794 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 29.496 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 18.478 s]
[INFO] Spark Project Core . SUCCESS [05:42 min]
[INFO] Spark Project Bagel  SUCCESS [  6.082 s]
[INFO] Spark Project GraphX ... SUCCESS [ 23.478 s]
[INFO] Spark Project Streaming  SUCCESS [ 53.969 s]
[INFO] Spark Project Catalyst . SUCCESS [02:12 min]
[INFO] Spark Project SQL .. SUCCESS [03:02 min]
[INFO] Spark Project ML Library ... SUCCESS [02:57 min]
[INFO] Spark Project Tools  SUCCESS [  3.139 s]
[INFO] Spark Project Hive . SUCCESS [03:25 min]
[INFO] Spark Project REPL . SUCCESS [ 18.303 s]
[INFO] Spark Project Assembly . SUCCESS [01:40 min]
[INFO] Spark Project External Twitter . SUCCESS [ 16.707 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 52.234 s]
[INFO] Spark Project External Flume ... SUCCESS [ 13.069 s]
[INFO] Spark Project External Flume Assembly .. SUCCESS [  4.653 s]
[INFO] Spark Project External MQTT  SUCCESS [01:56 min]
[INFO] Spark Project External MQTT Assembly ... SUCCESS [ 15.233 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [ 13.267 s]
[INFO] Spark Project External Kafka ... SUCCESS [ 41.663 s]
[INFO] Spark Project Examples . FAILURE [07:36 min]
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 40:07 min
[INFO] Finished at: 2015-10-08T13:14:31+05:30
[INFO] Final Memory: 373M/1205M
[INFO] 
[ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
resolve dependencies for project 
org.apache.spark:spark-examples_2.10:jar:1.6.0-SNAPSHOT: The following 
artifacts could not be resolved: com.twitter:algebird-core_2.10:jar:0.9.0, 
com.github.stephenc:jamm:jar:0.2.5: Could not transfer artifact 
com.twitter:algebird-core_2.10:jar:0.9.0 from/to central 
(https://repo1.maven.org/maven2): GET request of: 
com/twitter/algebird-core_2.10/0.9.0/algebird-core_2.10-0.9.0.jar from central 
failed: Connection reset -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-examples_2.10
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Build Failure

2015-10-08 Thread shahid ashraf
Yes this was temporary issue. Build Success
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [
10.791 s]
[INFO] Spark Project Test Tags  SUCCESS [
 3.743 s]
[INFO] Spark Project Launcher . SUCCESS [
12.242 s]
[INFO] Spark Project Networking ... SUCCESS [
16.550 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
12.557 s]
[INFO] Spark Project Unsafe ... SUCCESS [
10.005 s]
[INFO] Spark Project Core . SUCCESS [03:12
min]
[INFO] Spark Project Bagel  SUCCESS [
 5.712 s]
[INFO] Spark Project GraphX ... SUCCESS [
20.046 s]
[INFO] Spark Project Streaming  SUCCESS [
51.469 s]
[INFO] Spark Project Catalyst . SUCCESS [01:10
min]
[INFO] Spark Project SQL .. SUCCESS [01:32
min]
[INFO] Spark Project ML Library ... SUCCESS [01:38
min]
[INFO] Spark Project Tools  SUCCESS [
 2.839 s]
[INFO] Spark Project Hive . SUCCESS [01:19
min]
[INFO] Spark Project REPL . SUCCESS [
11.275 s]
[INFO] Spark Project Assembly . SUCCESS [01:26
min]
[INFO] Spark Project External Twitter . SUCCESS [
 9.562 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [
 7.824 s]
[INFO] Spark Project External Flume ... SUCCESS [
11.726 s]
[INFO] Spark Project External Flume Assembly .. SUCCESS [
 4.254 s]
[INFO] Spark Project External MQTT  SUCCESS [
24.255 s]
[INFO] Spark Project External MQTT Assembly ... SUCCESS [
12.513 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [
 9.756 s]
[INFO] Spark Project External Kafka ... SUCCESS [
15.013 s]
[INFO] Spark Project Examples . SUCCESS [02:08
min]
[INFO] Spark Project External Kafka Assembly .. SUCCESS [
 8.617 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 16:51 min
[INFO] Finished at: 2015-10-08T13:40:00+05:30
[INFO] Final Memory: 387M/1532M
[INFO]


On Thu, Oct 8, 2015 at 1:30 PM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> I just tried and it works for me (I don't have any Maven mirror on my
> subnet).
>
> Can you try again ? Maybe it was a temporary issue to access to Maven
> central.
>
> The artifact is present on central:
>
> http://repo1.maven.org/maven2/com/twitter/algebird-core_2.10/0.9.0/
>
> Regards
> JB
>
>
> On 10/08/2015 09:55 AM, shahid qadri wrote:
>
>> hi
>>
>> I tried to build latest master branch of spark
>> build/mvn -DskipTests clean package
>>
>>
>> Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM ... SUCCESS
>> [03:46 min]
>> [INFO] Spark Project Test Tags  SUCCESS
>> [01:02 min]
>> [INFO] Spark Project Launcher . SUCCESS
>> [01:03 min]
>> [INFO] Spark Project Networking ... SUCCESS [
>> 30.794 s]
>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>> 29.496 s]
>> [INFO] Spark Project Unsafe ... SUCCESS [
>> 18.478 s]
>> [INFO] Spark Project Core . SUCCESS
>> [05:42 min]
>> [INFO] Spark Project Bagel  SUCCESS [
>> 6.082 s]
>> [INFO] Spark Project GraphX ... SUCCESS [
>> 23.478 s]
>> [INFO] Spark Project Streaming  SUCCESS [
>> 53.969 s]
>> [INFO] Spark Project Catalyst . SUCCESS
>> [02:12 min]
>> [INFO] Spark Project SQL .. SUCCESS
>> [03:02 min]
>> [INFO] Spark Project ML Library ... SUCCESS
>> [02:57 min]
>> [INFO] Spark Project Tools  SUCCESS [
>> 3.139 s]
>> [INFO] Spark Project Hive . SUCCESS
>> [03:25 min]
>> [INFO] Spark Project REPL . SUCCESS [
>> 18.303 s]
>> [INFO] Spark Project Assembly . SUCCESS
>> [01:40 min]
>> [INFO] Spark Project External Twitter . 

repartition vs partitionby

2015-10-17 Thread shahid qadri
Hi folks

I need to reparation large set of data around(300G) as i see some portions have 
large data(data skew)

i have pairRDDs [({},{}),({},{}),({},{})]

what is the best way to solve the the problem
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition vs partitionby

2015-10-17 Thread shahid ashraf
yes i know about that,its in case to reduce partitions. the point here is
the data is skewed to few partitions..


On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> You can use coalesce function, if you want to reduce the number of
> partitions. This one minimizes the data shuffle.
>
> -Raghav
>
> On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri 
> wrote:
>
>> Hi folks
>>
>> I need to reparation large set of data around(300G) as i see some
>> portions have large data(data skew)
>>
>> i have pairRDDs [({},{}),({},{}),({},{})]
>>
>> what is the best way to solve the the problem
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
with Regards
Shahid Ashraf


Re: repartition vs partitionby

2015-10-18 Thread shahid ashraf
yes i am trying to do so. but it will try to repartition whole data.. can't
we split a large partition(data skewed partition) into multiple partitions
(any idea on this.).

On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tanase  wrote:

> If the dataset allows it you can try to write a custom partitioner to help
> spark distribute the data more uniformly.
>
> Sent from my iPhone
>
> On 17 Oct 2015, at 16:14, shahid ashraf  wrote:
>
> yes i know about that,its in case to reduce partitions. the point here is
> the data is skewed to few partitions..
>
>
> On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> You can use coalesce function, if you want to reduce the number of
>> partitions. This one minimizes the data shuffle.
>>
>> -Raghav
>>
>> On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri 
>> wrote:
>>
>>> Hi folks
>>>
>>> I need to reparation large set of data around(300G) as i see some
>>> portions have large data(data skew)
>>>
>>> i have pairRDDs [({},{}),({},{}),({},{})]
>>>
>>> what is the best way to solve the the problem
>>> -----
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> with Regards
> Shahid Ashraf
>
>


-- 
with Regards
Shahid Ashraf


LARGE COLLECT

2015-10-26 Thread shahid qadri
Hi Folks 

This might sound not apprioprate but i want to collect large data(15G approx).. 
and do some processing on driver and broadcast that back to each nodes. Is 
there any option to collect data off_heap.. like we can store  rdd off heap. 

i.e sort of collect data on tachyon FS... 



Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-10-29 Thread shahid ashraf
Hi
I guess you need to increase spark driver memory as well. But that should
be set in conf files
Let me know if that resolves
On Oct 30, 2015 7:33 AM, "karthik kadiyam" 
wrote:

> Hi,
>
> In spark streaming job i had the following setting
>
> this.jsc.getConf().set("spark.driver.maxResultSize", “0”);
> and i got the error in the job as below
>
> User class threw exception: Job aborted due to stage failure: Total size
> of serialized results of 120 tasks (1082.2 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
>
> Basically i realized that as default value is 1 GB. I changed
> the configuration as below.
>
> this.jsc.getConf().set("spark.driver.maxResultSize", “2g”);
>
> and when i ran the job it gave the error
>
> User class threw exception: Job aborted due to stage failure: Total size
> of serialized results of 120 tasks (1082.2 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
>
> So, basically the change i made is not been considered in the job. so my
> question is
>
> - "spark.driver.maxResultSize", “2g” is this the right way to change or
> any other way to do it.
> - Is this a bug in spark 1.3 or something or any one had this issue
> before?
>
>


Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-11-01 Thread shahid ashraf
Is your process getting killed...
if yes then try to see using dmesg.

On Mon, Nov 2, 2015 at 8:17 AM, karthik kadiyam <
karthik.kadiyam...@gmail.com> wrote:

> Did any one had issue setting spark.driver.maxResultSize value ?
>
> On Friday, October 30, 2015, karthik kadiyam 
> wrote:
>
>> Hi Shahid,
>>
>> I played around with spark driver memory too. In the conf file it was set
>> to " --driver-memory 20G " first. When i changed the spark driver
>> maxResultSize from default to 2g ,i changed the driver memory to 30G and
>> tired too. It gave we same error says "bigger than  (1024.0 MB) " .
>> spark.driver.maxResultSize
>> One other thing i observed is , in one of the tasks the data its trying
>> to process is more than 100 MB and that exceutor and task keeps losing
>> connection and doing retry. I tried increase the Tasks by repartition from
>> 120 to 240 to 480 also. Still i can see in one of my tasks it still is
>> trying to process more than 100 mb. Other task hardly process 1 mb to 10 mb
>> , some around 20 mbs, some have 0 mbs .
>>
>> Any idea how can i try to even the data distribution acrosss multiple
>> node.
>>
>> On Fri, Oct 30, 2015 at 12:09 AM, shahid ashraf 
>> wrote:
>>
>>> Hi
>>> I guess you need to increase spark driver memory as well. But that
>>> should be set in conf files
>>> Let me know if that resolves
>>> On Oct 30, 2015 7:33 AM, "karthik kadiyam" 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> In spark streaming job i had the following setting
>>>>
>>>> this.jsc.getConf().set("spark.driver.maxResultSize", “0”);
>>>> and i got the error in the job as below
>>>>
>>>> User class threw exception: Job aborted due to stage failure: Total
>>>> size of serialized results of 120 tasks (1082.2 MB) is bigger than
>>>> spark.driver.maxResultSize (1024.0 MB)
>>>>
>>>> Basically i realized that as default value is 1 GB. I changed
>>>> the configuration as below.
>>>>
>>>> this.jsc.getConf().set("spark.driver.maxResultSize", “2g”);
>>>>
>>>> and when i ran the job it gave the error
>>>>
>>>> User class threw exception: Job aborted due to stage failure: Total
>>>> size of serialized results of 120 tasks (1082.2 MB) is bigger than
>>>> spark.driver.maxResultSize (1024.0 MB)
>>>>
>>>> So, basically the change i made is not been considered in the job. so
>>>> my question is
>>>>
>>>> - "spark.driver.maxResultSize", “2g” is this the right way to change
>>>> or any other way to do it.
>>>> - Is this a bug in spark 1.3 or something or any one had this issue
>>>> before?
>>>>
>>>>
>>


-- 
with Regards
Shahid Ashraf


HDFS

2015-12-11 Thread shahid ashraf
hi Folks

I am using standalone cluster of 50 servers on aws. i loaded data on hdfs,
 why i am getting Locality Level as ANY for data on hdfs, i have 900+
partitions.


-- 
with Regards
Shahid Ashraf


RE: Code review - Spark SQL command-line client for Cassandra

2015-06-20 Thread shahid ashraf
Hi Mohammad
Can you provide more info about the Service u developed
On Jun 20, 2015 7:59 AM, "Mohammed Guller"  wrote:

>  Hi Matthew,
>
> It looks fine to me. I have built a similar service that allows a user to
> submit a query from a browser and returns the result in JSON format.
>
>
>
> Another alternative is to leave a Spark shell or one of the notebooks
> (Spark Notebook, Zeppelin, etc.) session open and run queries from there.
> This model works only if people give you the queries to execute.
>
>
>
> Mohammed
>
>
>
> *From:* Matthew Johnson [mailto:matt.john...@algomi.com]
> *Sent:* Friday, June 19, 2015 2:20 AM
> *To:* user@spark.apache.org
> *Subject:* Code review - Spark SQL command-line client for Cassandra
>
>
>
> Hi all,
>
>
>
> I have been struggling with Cassandra’s lack of adhoc query support (I
> know this is an anti-pattern of Cassandra, but sometimes management come
> over and ask me to run stuff and it’s impossible to explain that it will
> take me a while when it would take about 10 seconds in MySQL) so I have put
> together the following code snippet that bundles DataStax’s Cassandra Spark
> connector and allows you to submit Spark SQL to it, outputting the results
> in a text file.
>
>
>
> Does anyone spot any obvious flaws in this plan?? (I have a lot more error
> handling etc in my code, but removed it here for brevity)
>
>
>
> *private* *void* run(String sqlQuery) {
>
> SparkContext scc = *new* SparkContext(conf);
>
> CassandraSQLContext csql = *new* CassandraSQLContext(scc);
>
> DataFrame sql = csql.sql(sqlQuery);
>
> String folderName = "/tmp/output_" + System.*currentTimeMillis*();
>
> *LOG*.info("Attempting to save SQL results in folder: " +
> folderName);
>
> sql.rdd().saveAsTextFile(folderName);
>
> *LOG*.info("SQL results saved");
>
> }
>
>
>
> *public* *static* *void* main(String[] args) {
>
>
>
> String sparkMasterUrl = args[0];
>
> String sparkHost = args[1];
>
> String sqlQuery = args[2];
>
>
>
> SparkConf conf = *new* SparkConf();
>
> conf.setAppName("Java Spark SQL");
>
> conf.setMaster(sparkMasterUrl);
>
> conf.set("spark.cassandra.connection.host", sparkHost);
>
>
>
> JavaSparkSQL app = *new* JavaSparkSQL(conf);
>
>
>
> app.run(sqlQuery, printToConsole);
>
> }
>
>
>
> I can then submit this to Spark with ‘spark-submit’:
>
>
>
> Ø  *./spark-submit --class com.algomi.spark.JavaSparkSQL --master
> spark://sales3:7077
> spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar
> spark://sales3:7077 sales3 "select * from mykeyspace.operationlog" *
>
>
>
> It seems to work pretty well, so I’m pretty happy, but wondering why this
> isn’t common practice (at least I haven’t been able to find much about it
> on Google) – is there something terrible that I’m missing?
>
>
>
> Thanks!
>
> Matthew
>
>
>
>
>


Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread shahid ashraf
hi Folks

I am newbie to spark world, this seems very interesting work as well as
discusion.  I have same sort of use case.
I usuall use mysql to blocking query for Record Linkage , as of now the
data has grown very much and it's not scalling. I want to store all my data
on hdfs and expose it via spark sql. but i see usually its executed as
batch process. I want to expose spark sql as sort of api, just sql sort of
thing. not batch processing. Please provide necessary guidance ..

On Mon, Jun 22, 2015 at 10:22 PM, Matthew Johnson 
wrote:

> Hi Pawan,
>
>
>
> Looking at the changes for that git pull request, it looks like it just
> pulls in the dependency (and transitives) for “spark-cassandra-connector”.
> Since I am having to build Zeppelin myself anyway, would it be ok to just
> add this myself for the connector for 1.4.0 (as found here
> http://search.maven.org/#artifactdetails%7Ccom.datastax.spark%7Cspark-cassandra-connector_2.11%7C1.4.0-M1%7Cjar)?
> What exactly is it that does not currently exist for Spark 1.4?
>
>
>
> Thanks,
>
> Matthew
>
>
>
> *From:* pawan kumar [mailto:pkv...@gmail.com]
> *Sent:* 22 June 2015 17:19
> *To:* Silvio Fiorito
> *Cc:* Mohammed Guller; Matthew Johnson; shahid ashraf;
> user@spark.apache.org
> *Subject:* Re: Code review - Spark SQL command-line client for Cassandra
>
>
>
> Hi,
>
>
>
> Zeppelin has a cassandra-spark-connector built into the build. I have not
> tried it yet may be you could let us know.
>
>
>
> https://github.com/apache/incubator-zeppelin/pull/79
>
>
>
> To build a Zeppelin version with the *Datastax Spark/Cassandra connector
> <https://github.com/datastax/spark-cassandra-connector>*
>
> mvn clean package *-Pcassandra-spark-1.x* -Dhadoop.version=xxx
> -Phadoop-x.x -DskipTests
>
> Right now the Spark/Cassandra connector is available for *Spark 1.1* and 
> *Spark
> 1.2*. Support for *Spark 1.3* is not released yet (*but you can build you
> own Spark/Cassandra connector version **1.3.0-SNAPSHOT*). Support for *Spark
> 1.4* does not exist yet
>
> Please do not forget to add -Dspark.cassandra.connection.host=xxx to the
> *ZEPPELIN_JAVA_OPTS*parameter in *conf/zeppelin-env.sh* file.
> Alternatively you can add this parameter in the parameter list of the *Spark
> interpreter* on the GUI
>
>
>
> -Pawan
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Jun 22, 2015 at 9:04 AM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
> Yes, just put the Cassandra connector on the Spark classpath and set the
> connector config properties in the interpreter settings.
>
>
>
> *From: *Mohammed Guller
> *Date: *Monday, June 22, 2015 at 11:56 AM
> *To: *Matthew Johnson, shahid ashraf
>
>
> *Cc: *"user@spark.apache.org"
> *Subject: *RE: Code review - Spark SQL command-line client for Cassandra
>
>
>
> I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for
> sure, but it should not be difficult.
>
>
>
> Mohammed
>
>
>
> *From:* Matthew Johnson [mailto:matt.john...@algomi.com
> ]
> *Sent:* Monday, June 22, 2015 2:15 AM
> *To:* Mohammed Guller; shahid ashraf
> *Cc:* user@spark.apache.org
> *Subject:* RE: Code review - Spark SQL command-line client for Cassandra
>
>
>
> Thanks Mohammed, it’s good to know I’m not alone!
>
>
>
> How easy is it to integrate Zeppelin with Spark on Cassandra? It looks
> like it would only support Hadoop out of the box. Is it just a case of
> dropping the Cassandra Connector onto the Spark classpath?
>
>
>
> Cheers,
>
> Matthew
>
>
>
> *From:* Mohammed Guller [mailto:moham...@glassbeam.com]
> *Sent:* 20 June 2015 17:27
> *To:* shahid ashraf
> *Cc:* Matthew Johnson; user@spark.apache.org
> *Subject:* RE: Code review - Spark SQL command-line client for Cassandra
>
>
>
> It is a simple Play-based web application. It exposes an URI for
> submitting a SQL query. It then executes that query using
> CassandraSQLContext provided by Spark Cassandra Connector. Since it is
> web-based, I added an authentication and authorization layer to make sure
> that only users with the right authorization can use it.
>
>
>
> I am happy to open-source that code if there is interest. Just need to
> carve out some time to clean it up and remove all the other services that
> this web application provides.
>
>
>
> Mohammed
>
>
>
> *From:* shahid ashraf [mailto:sha...@trialx.com ]
> *Sent:* Saturday, June 20, 2015 6:52 AM
> *To:* Mohammed Guller
> *Cc:* Matthew Johnson; user@spark.apache.org
> *Subject:* RE: Code review - Spark SQL command-line client for Cassandra
>
>
>
> Hi Mo

DAG info

2015-01-01 Thread shahid ashraf
hi guys


i have just starting using spark, i am getting this as an info
15/01/02 11:54:17 INFO DAGScheduler: Parents of final stage: List()
15/01/02 11:54:17 INFO DAGScheduler: Missing parents: List()
15/01/02 11:54:17 INFO DAGScheduler: Submitting Stage 6 (PythonRDD[12] at
RDD at PythonRDD.scala:43), which has no missing parents

and my program is taking lot of time to execute.
-- 
with Regards
Shahid Ashraf


Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid ashraf
hi costin i upgraded the es hadoop connector , and at this point i can't
use scala, but still getting same error

On Tue, Feb 10, 2015 at 10:34 PM, Costin Leau  wrote:

> Hi shahid,
>
> I've sent the reply to the group - for some reason I replied to your
> address instead of the mailing list.
> Let's continue the discussion there.
>
> Cheers,
>
> On 2/10/15 6:58 PM, shahid ashraf wrote:
>
>> thanks costin
>>
>> i m grouping data together based on id in json and rdd contains
>> rdd = (1,{'SOURCES': [{n no. of key/valu}],}),(2,{'SOURCES': [{n no. of
>> key/valu}],}),(3,{'SOURCES': [{n no. of
>> key/valu}],}),(4,{'SOURCES': [{n no. of key/valu}],})
>> rdd.saveAsNewAPIHadoopFile(
>>  path='-',
>>  outputFormatClass="org.elasticsearch.hadoop.mr.
>> EsOutputFormat",
>>  keyClass="org.apache.hadoop.io.NullWritable",
>>  valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
>>  conf={
>>  "es.nodes" : "localhost",
>>  "es.port" : "9200",
>>  "es.resource" : "shahid/hcp_id"
>>  })
>>
>>
>> spark-1.1.0-bin-hadoop1
>> java version "1.7.0_71"
>> elasticsearch-1.4.2
>> elasticsearch-hadoop-2.1.0.Beta2.jar
>>
>>
>> On Tue, Feb 10, 2015 at 10:05 PM, Costin Leau > <mailto:costin.l...@gmail.com>> wrote:
>>
>> Sorry but there's too little information in this email to make any
>> type of assesment.
>> Can you please describe what you are trying to do, what version of
>> Elastic and es-spark are you suing
>> and potentially post a snippet of code?
>> What does your RDD contain?
>>
>>
>> On 2/10/15 6:05 PM, shahid wrote:
>>
>> INFO scheduler.TaskSetManager: Starting task 2.1 in stage 2.0
>> (TID 9,
>> ip-10-80-98-118.ec2.internal, PROCESS_LOCAL, 1025 bytes)
>> 15/02/10 15:54:08 INFO scheduler.TaskSetManager: Lost task 1.0 in
>> stage 2.0
>> (TID 6) on executor ip-10-80-15-145.ec2.internal:
>> org.apache.spark.__SparkException (Data of type
>> java.util.ArrayList cannot be
>> used) [duplicate 1]
>> 15/02/10 15:54:08 INFO scheduler.TaskSetManager: Starting task
>> 1.1 in stage
>> 2.0 (TID 10, ip-10-80-15-145.ec2.internal, PROCESS_LOCAL, 1025
>> bytes)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.__1001560.n3.nabble.com/__
>> Exception-when-trying-to-use-__EShadoop-connector-and-__
>> writing-rdd-to-ES-tp21579.html
>> <http://apache-spark-user-list.1001560.n3.nabble.com/
>> Exception-when-trying-to-use-EShadoop-connector-and-
>> writing-rdd-to-ES-tp21579.html>
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>>
>> --
>> Costin
>>
>>
>>
>>
>> --
>> with Regards
>> Shahid Ashraf
>>
> --
> Costin
>



-- 
with Regards
Shahid Ashraf


Re: event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2019-07-17 Thread Shahid K. I.
Hi,
With the current design, eventlogs are not ideal for long running streaming
applications. So, it is better then to disable the eventlogs. There was  a
proposal for splitting the eventlogs based on size/Job/query for long
running applications,  not sure about the followup for that.

Regards,
Shahid

On Tue, 16 Jul 2019, 3:38 pm raman gugnani, 
wrote:

> HI ,
>
> I have long running spark streaming jobs.
> Event log directories are getting filled with .inprogress files.
> Is there fix or work around for spark streaming.
>
> There is also one jira raised for the same by one reporter.
>
> https://issues.apache.org/jira/browse/SPARK-22783
>
> --
> Raman Gugnani
>
> 8588892293
> Principal Engineer
> *ixigo.com <http://ixigo.com>*
>