Re: event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2019-07-17 Thread Shahid K. I.
, Shahid On Tue, 16 Jul 2019, 3:38 pm raman gugnani, wrote: > HI , > > I have long running spark streaming jobs. > Event log directories are getting filled with .inprogress files. > Is there fix or work around for spark streaming. > > There is also one jira raised for the

HDFS

2015-12-11 Thread shahid ashraf
hi Folks I am using standalone cluster of 50 servers on aws. i loaded data on hdfs, why i am getting Locality Level as ANY for data on hdfs, i have 900+ partitions. -- with Regards Shahid Ashraf

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-11-01 Thread shahid ashraf
lt;karthik.kadiyam...@gmail.com> > wrote: > >> Hi Shahid, >> >> I played around with spark driver memory too. In the conf file it was set >> to " --driver-memory 20G " first. When i changed the spark driver >> maxResultSize from default to 2g ,i changed th

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-10-29 Thread shahid ashraf
Hi I guess you need to increase spark driver memory as well. But that should be set in conf files Let me know if that resolves On Oct 30, 2015 7:33 AM, "karthik kadiyam" wrote: > Hi, > > In spark streaming job i had the following setting > >

Python worker exited unexpectedly (crashed)

2015-10-22 Thread shahid
Hi I am running 10 node standalone cluster on aws and loading 100G data on HDFS.. doing first groupby operation. and then generating pairs from the groupedrdd (key,[a1,b1],key,[a,b,c]) generating the pairs like (a1,b1),(a,b),(a,c) ... n PairRDD will get large in size. some stats from ui when

Re: How does shuffle work in spark ?

2015-10-19 Thread shahid
@all i did partitionby using default hash partitioner on data [(1,data)(2,(data),(n,data)] the total data was approx 3.5 it showed shuffle write 50G and on next action e.g count it is showing shuffle read of 50 G. i don't understand this behaviour and i think the performance is getting slow with

Re: repartition vs partitionby

2015-10-18 Thread shahid ashraf
to write a custom partitioner to help > spark distribute the data more uniformly. > > Sent from my iPhone > > On 17 Oct 2015, at 16:14, shahid ashraf <sha...@trialx.com> wrote: > > yes i know about that,its in case to reduce partitions. the point here is > the data is

repartition vs partitionby

2015-10-17 Thread shahid qadri
Hi folks I need to reparation large set of data around(300G) as i see some portions have large data(data skew) i have pairRDDs [({},{}),({},{}),({},{})] what is the best way to solve the the problem - To unsubscribe, e-mail:

Re: repartition vs partitionby

2015-10-17 Thread shahid ashraf
ns. This one minimizes the data shuffle. > > -Raghav > > On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri <shahidashr...@icloud.com> > wrote: > >> Hi folks >> >> I need to reparation large set of data around(300G) as i see some >&g

Build Failure

2015-10-08 Thread shahid qadri
hi I tried to build latest master branch of spark build/mvn -DskipTests clean package Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [03:46 min] [INFO] Spark Project Test Tags SUCCESS [01:02 min] [INFO] Spark Project

API to run spark Jobs

2015-10-06 Thread shahid qadri
Hi Folks How i can submit my spark app(python) to the cluster without using spark-submit, actually i need to invoke jobs from UI - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: API to run spark Jobs

2015-10-06 Thread shahid qadri
p > distros might, for example EMR in AWS has a job submit UI. > > Spark submit just calls a REST api, you could build any UI you want on top of > that... > > > On Tue, Oct 6, 2015 at 9:37 AM, shahid qadri <shahidashr...@icloud.com > <mailto:shahidashr...@icloud.com>

INDEXEDRDD in PYSPARK

2015-09-03 Thread shahid ashraf
Hi Folks Any resource to get started using https://github.com/amplab/spark-indexedrdd in pyspark -- with Regards Shahid Ashraf

Re: Custom Partitioner

2015-09-02 Thread shahid ashraf
> > > > class RangePartitioner(Partitioner): > > def __init__(self, numParts): > > self.numPartitions = numParts > > self.partitionFunction = rangePartition > > def rangePartition(key): > > # Logic to turn key into a partition id > > return id >

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread shahid ashraf
led 15/09/02 21:12:43 INFO DAGScheduler: ShuffleMapStage 10 (repartition at NativeMethodAccessorImpl.java:-2) failed in 102.132 s 15/09/02 21:12:43 INFO DAGScheduler: Job 4 failed: collect at /Users/shahid/projects/spark_rl/record_linker_spark.py:74, took 102.154710 s Traceback (most recent call last): Fi

ERROR WHILE REPARTITION

2015-09-02 Thread shahid ashraf
leMapStage 10 (repartition at NativeMethodAccessorImpl.java:-2) failed in 102.132 s 15/09/02 21:12:43 INFO DAGScheduler: Job 4 failed: collect at /Users/shahid/projects/spark_rl/record_linker_spark.py:74, took 102.154710 s Traceback (most recent call last): File "/Users/shahid/projects/spark_rl/record_linker_sp

Custom Partitioner

2015-09-01 Thread shahid qadri
Hi Sparkians How can we create a customer partition in pyspark - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: How to effieciently write sorted neighborhood in pyspark

2015-09-01 Thread shahid qadri
> On Aug 25, 2015, at 10:43 PM, shahid qadri <shahidashr...@icloud.com> wrote: > > Any resources on this > >> On Aug 25, 2015, at 3:15 PM, shahid qadri <shahidashr...@icloud.com> wrote: >> >> I would like to implement sorted neighborhood approach i

Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
ee below > > class MyPartitioner extends partitioner { > def numPartitions: Int = // Return the number of partitions > def getPartition(key Any): Int = // Return the partition for a given key > } > > On Tue, Sep 1, 2015 at 10:15 AM shahid qadri <shahidashr...@icloud.com>

Re: Custom Partitioner

2015-09-01 Thread shahid ashraf
just need > to instantiate the Partitioner class with numPartitions and partitionFunc. > > On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf <sha...@trialx.com> wrote: > >> Hi >> >> I did not get this, e.g if i need to create a custom partitioner like >> range partitione

Re: How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
Any resources on this On Aug 25, 2015, at 3:15 PM, shahid qadri shahidashr...@icloud.com wrote: I would like to implement sorted neighborhood approach in spark, what is the best way to write that in pyspark

How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
I would like to implement sorted neighborhood approach in spark, what is the best way to write that in pyspark. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Spark ec2 lunch problem

2015-08-21 Thread shahid ashraf
launch spark-cluster but getting following message endless. Please help. Warning: SSH connection error. (This could be temporary.) Host: SSH return code: 255 SSH output: ssh: Could not resolve hostname : Name or service not known -- with Regards Shahid Ashraf

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread shahid ashraf
? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit -- with Regards Shahid Ashraf

Re: No. of Task vs No. of Executors

2015-07-21 Thread shahid ashraf
, then its a node issue, else, ost ikely data issue On Tue, Jul 14, 2015 at 11:43 PM, shahid sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking

Re: Research ideas using spark

2015-07-15 Thread shahid ashraf
- there's a lot of work being done on this. Best, Will On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com wrote: Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I

No. of Task vs No. of Executors

2015-07-14 Thread shahid
hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-20 Thread shahid ashraf
Hi Mohammad Can you provide more info about the Service u developed On Jun 20, 2015 7:59 AM, Mohammed Guller moham...@glassbeam.com wrote: Hi Matthew, It looks fine to me. I have built a similar service that allows a user to submit a query from a browser and returns the result in JSON

getting this error while runing

2015-02-28 Thread shahid
conf = SparkConf().setAppName(spark_calc3merged).setMaster(spark://ec2-54-145-68-13.compute-1.amazonaws.com:7077) sc = SparkContext(conf=conf,pyFiles=[/root/platinum.py,/root/collections2.py]) 15/02/28 19:06:38 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3.0 (TID 38,

Re: getting this error while runing

2015-02-28 Thread shahid
Also the data file is on hdfs -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/getting-this-error-while-runing-tp21860p21861.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid ashraf
hi costin i upgraded the es hadoop connector , and at this point i can't use scala, but still getting same error On Tue, Feb 10, 2015 at 10:34 PM, Costin Leau costin.l...@gmail.com wrote: Hi shahid, I've sent the reply to the group - for some reason I replied to your address instead

Exception when trying to use EShadoop connector and writing rdd to ES

2015-02-10 Thread shahid
INFO scheduler.TaskSetManager: Starting task 2.1 in stage 2.0 (TID 9, ip-10-80-98-118.ec2.internal, PROCESS_LOCAL, 1025 bytes) 15/02/10 15:54:08 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 6) on executor ip-10-80-15-145.ec2.internal: org.apache.spark.SparkException (Data of type

problem while running code

2015-01-05 Thread shahid
the log is here py4j.protocol.Py4JError: An error occurred while calling o22.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at

DAG info

2015-01-01 Thread shahid
hi guys i have just starting using spark, i am getting this as an info 15/01/02 11:54:17 INFO DAGScheduler: Parents of final stage: List() 15/01/02 11:54:17 INFO DAGScheduler: Missing parents: List() 15/01/02 11:54:17 INFO DAGScheduler: Submitting Stage 6 (PythonRDD[12] at RDD at