Re: What is the best way for Spark to read HDF5@scale?

2018-09-17 Thread Saurav Sinha
Here is the solution

sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")

How did I find out nn1home:8020?

Just search for the file core-site.xml and look for xml element fs.defaultFS

On Fri, Sep 14, 2018 at 7:57 PM kathleen li  wrote:

> Hi,
> Any Spark-connector for HDF5?
>
> The following link does not work anymore?
>
> https://www.hdfgroup.org/downloads/spark-connector/
> down vo
>
> Thanks,
>
> Kathleen
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Setting spark.yarn.stagingDir in 1.6

2017-03-15 Thread Saurav Sinha
Hi Users,


I am running spark job in yarn.

I want to set staging directory to some other location which is by default
hdfs://host:port/home/$User/

In spark 2.0.0, it can be done by setting spark.yarn.stagingDir.

But in production, we have spark 1.6. Can anyone please suggest how it can
be done in spark 1.6.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Can any one help me out

On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha <sauravsinh...@gmail.com>
wrote:

> Hi,
>
> I am in situation where I want to generate unique Id for each row.
>
> I have use monotonicallyIncreasingId but it is giving increasing values
> and start generating from start if it fail.
>
> I have two question here:
>
> Q1. Does this method give me unique id even in failure situation becaue I
> want to use that ID in my solr id.
>
> Q2. If answer to previous question is NO. Then Is there way yo generate
> UUID for each row which is uniqe and not updatedable.
>
> As I have come up with situation where UUID is updated
>
>
> val idUDF = udf(() => UUID.randomUUID().toString)
> val a = withColumn("alarmUUID", lit(idUDF()))
> a.persist(StorageLevel.MEMORY_AND_DISK)
> rawDataDf.registerTempTable("rawAlarms")
>
> ///
> /// I do some joines
>
> but as I reach further below
>
> I do sonthing like
> b is transformation of a
> sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
>   from a right outer join b on a.alarmUUID =
> b.alarmUUID""")
>
> it give output as
>
> +++
>
> |   alarmUUID|   alarmUUID|
> ++----+
> |7d33a516-5532-410...|null|
> |    null|2439d6db-16a2-44b...|
> +++
>
>
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Hi,

I am in situation where I want to generate unique Id for each row.

I have use monotonicallyIncreasingId but it is giving increasing values and
start generating from start if it fail.

I have two question here:

Q1. Does this method give me unique id even in failure situation becaue I
want to use that ID in my solr id.

Q2. If answer to previous question is NO. Then Is there way yo generate
UUID for each row which is uniqe and not updatedable.

As I have come up with situation where UUID is updated


val idUDF = udf(() => UUID.randomUUID().toString)
val a = withColumn("alarmUUID", lit(idUDF()))
a.persist(StorageLevel.MEMORY_AND_DISK)
rawDataDf.registerTempTable("rawAlarms")

///
/// I do some joines

but as I reach further below

I do sonthing like
b is transformation of a
sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
  from a right outer join b on a.alarmUUID =
b.alarmUUID""")

it give output as

+++

|   alarmUUID|   alarmUUID|
+++
|7d33a516-5532-410...|null|
|null|2439d6db-16a2-44b...|
+----+--------+



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Finding unique across all columns in dataset

2016-09-19 Thread Saurav Sinha
You can use distinct over you data frame or rdd

rdd.distinct

It will give you distinct across your row.

On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand <abhis.anan...@gmail.com>
wrote:

> I have an rdd which contains 14 different columns. I need to find the
> distinct across all the columns of rdd and write it to hdfs.
>
> How can I acheive this ?
>
> Is there any distributed data structure that I can use and keep on
> updating it as I traverse the new rows ?
>
> Regards,
> Abhi
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Explanation regarding Spark Streaming

2016-08-04 Thread Saurav Sinha
Hi,

I have query

Q1. What will happen if spark streaming job have batchDurationTime as 60
sec and processing time of complete pipeline is greater then 60 sec.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Spark driver getting out of memory

2016-07-20 Thread Saurav Sinha
Hi,

I have set driver memory 10 GB and job ran with intermediate failure which
is recovered back by spark.

But I still what to know if no of parts increases git driver ram need to be
increased and what is ration of no of parts/RAM.

@RK : I am using cache on RDD. Is this reason of high RAM utilization.

Thanks,
Saurav Sinha

On Tue, Jul 19, 2016 at 10:14 PM, RK Aduri <rkad...@collectivei.com> wrote:

> Just want to see if this helps.
>
> Are you doing heavy collects and persist that? If that is so, you might
> want to parallelize that collection by converting to an RDD.
>
> Thanks,
> RK
>
> On Tue, Jul 19, 2016 at 12:09 AM, Saurav Sinha <sauravsinh...@gmail.com>
> wrote:
>
>> Hi Mich,
>>
>>1. In what mode are you running the spark standalone, yarn-client,
>>yarn cluster etc
>>
>> Ans: spark standalone
>>
>>1. You have 4 nodes with each executor having 10G. How many actual
>>executors do you see in UI (Port 4040 by default)
>>
>> Ans: There are 4 executor on which am using 8 cores
>> (--total-executor-core 32)
>>
>>1. What is master memory? Are you referring to diver memory? May be I
>>am misunderstanding this
>>
>> Ans: Driver memory is set as --drive-memory 5g
>>
>>1. The only real correlation I see with the driver memory is when you
>>are running in local mode where worker lives within JVM process that you
>>start with spark-shell etc. In that case driver memory matters. However, 
>> it
>>appears that you are running in another mode with 4 nodes?
>>
>> Ans: I am running my job as spark-submit and on my worker(executor) node
>> there is no OOM issue ,it only happening on driver app.
>>
>> Thanks,
>> Saurav Sinha
>>
>> On Tue, Jul 19, 2016 at 2:42 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> can you please clarify:
>>>
>>>
>>>1. In what mode are you running the spark standalone, yarn-client,
>>>yarn cluster etc
>>>2. You have 4 nodes with each executor having 10G. How many actual
>>>executors do you see in UI (Port 4040 by default)
>>>3. What is master memory? Are you referring to diver memory? May be
>>>I am misunderstanding this
>>>4. The only real correlation I see with the driver memory is when
>>>you are running in local mode where worker lives within JVM process that
>>>you start with spark-shell etc. In that case driver memory matters.
>>>However, it appears that you are running in another mode with 4 nodes?
>>>
>>> Can you get a snapshot of your environment tab in UI and send the output
>>> please?
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 18 July 2016 at 11:50, Saurav Sinha <sauravsinh...@gmail.com> wrote:
>>>
>>>> I have set --drive-memory 5g. I need to understand that as no of
>>>> partition increase drive-memory need to be increased. What will be
>>>> best ration of No of partition/drive-memory.
>>>>
>>>> On Mon, Jul 18, 2016 at 4:07 PM, Zhiliang Zhu <zchl.j...@yahoo.com>
>>>> wrote:
>>>>
>>>>> try to set --drive-memory xg , x would be as large as can be set .
>>>>>
>>>>>
>>>>> On Monday, July 18, 2016 6:31 PM, Saurav Sinha <
>>>>> sauravsinh...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am running spark job.
>>>>>
>>>>> Master memory - 5G
>>>>> executor memort 10G(running on 4 node)
>>>>>
>>>>> My job is getting killed as no of partition increase to 20K.
>>>>>

Re: Spark driver getting out of memory

2016-07-19 Thread Saurav Sinha
Hi Mich,

   1. In what mode are you running the spark standalone, yarn-client, yarn
   cluster etc

Ans: spark standalone

   1. You have 4 nodes with each executor having 10G. How many actual
   executors do you see in UI (Port 4040 by default)

Ans: There are 4 executor on which am using 8 cores (--total-executor-core
32)

   1. What is master memory? Are you referring to diver memory? May be I am
   misunderstanding this

Ans: Driver memory is set as --drive-memory 5g

   1. The only real correlation I see with the driver memory is when you
   are running in local mode where worker lives within JVM process that you
   start with spark-shell etc. In that case driver memory matters. However, it
   appears that you are running in another mode with 4 nodes?

Ans: I am running my job as spark-submit and on my worker(executor) node
there is no OOM issue ,it only happening on driver app.

Thanks,
Saurav Sinha

On Tue, Jul 19, 2016 at 2:42 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> can you please clarify:
>
>
>1. In what mode are you running the spark standalone, yarn-client,
>yarn cluster etc
>2. You have 4 nodes with each executor having 10G. How many actual
>executors do you see in UI (Port 4040 by default)
>3. What is master memory? Are you referring to diver memory? May be I
>am misunderstanding this
>4. The only real correlation I see with the driver memory is when you
>are running in local mode where worker lives within JVM process that you
>start with spark-shell etc. In that case driver memory matters. However, it
>appears that you are running in another mode with 4 nodes?
>
> Can you get a snapshot of your environment tab in UI and send the output
> please?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 18 July 2016 at 11:50, Saurav Sinha <sauravsinh...@gmail.com> wrote:
>
>> I have set --drive-memory 5g. I need to understand that as no of
>> partition increase drive-memory need to be increased. What will be best
>> ration of No of partition/drive-memory.
>>
>> On Mon, Jul 18, 2016 at 4:07 PM, Zhiliang Zhu <zchl.j...@yahoo.com>
>> wrote:
>>
>>> try to set --drive-memory xg , x would be as large as can be set .
>>>
>>>
>>> On Monday, July 18, 2016 6:31 PM, Saurav Sinha <sauravsinh...@gmail.com>
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I am running spark job.
>>>
>>> Master memory - 5G
>>> executor memort 10G(running on 4 node)
>>>
>>> My job is getting killed as no of partition increase to 20K.
>>>
>>> 16/07/18 14:53:13 INFO DAGScheduler: Got job 17 (foreachPartition at
>>> WriteToKafka.java:45) with 13524 output partitions (allowLocal=false)
>>> 16/07/18 14:53:13 INFO DAGScheduler: Final stage: ResultStage
>>> 640(foreachPartition at WriteToKafka.java:45)
>>> 16/07/18 14:53:13 INFO DAGScheduler: Parents of final stage:
>>> List(ShuffleMapStage 518, ShuffleMapStage 639)
>>> 16/07/18 14:53:23 INFO DAGScheduler: Missing parents: List()
>>> 16/07/18 14:53:23 INFO DAGScheduler: Submitting ResultStage 640
>>> (MapPartitionsRDD[271] at map at BuildSolrDocs.java:209), which has no
>>> missing
>>> parents
>>> 16/07/18 14:53:23 INFO MemoryStore: ensureFreeSpace(8248) called with
>>> curMem=41923262, maxMem=2778778828
>>> 16/07/18 14:53:23 INFO MemoryStore: Block broadcast_90 stored as values
>>> in memory (estimated size 8.1 KB, free 2.5 GB)
>>> Exception in thread "dag-scheduler-event-loop"
>>> java.lang.OutOfMemoryError: Java heap space
>>> at
>>> org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
>>> at
>>> org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
>>>     at
>>> org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
>>> at
>>> org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273)
>>> at
>>> org.apache.spark.io.SnappyOutputStreamWrapper.flush(CompressionCodec.scala:197)
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
>>>
>>>
>>> Help needed.
>>>
>>> --
>>> Thanks and Regards,
>>>
>>> Saurav Sinha
>>>
>>> Contact: 9742879062
>>>
>>>
>>>
>>
>>
>> --
>> Thanks and Regards,
>>
>> Saurav Sinha
>>
>> Contact: 9742879062
>>
>
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Spark driver getting out of memory

2016-07-18 Thread Saurav Sinha
I have set --drive-memory 5g. I need to understand that as no of partition
increase drive-memory need to be increased. What will be best ration of No
of partition/drive-memory.

On Mon, Jul 18, 2016 at 4:07 PM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:

> try to set --drive-memory xg , x would be as large as can be set .
>
>
> On Monday, July 18, 2016 6:31 PM, Saurav Sinha <sauravsinh...@gmail.com>
> wrote:
>
>
> Hi,
>
> I am running spark job.
>
> Master memory - 5G
> executor memort 10G(running on 4 node)
>
> My job is getting killed as no of partition increase to 20K.
>
> 16/07/18 14:53:13 INFO DAGScheduler: Got job 17 (foreachPartition at
> WriteToKafka.java:45) with 13524 output partitions (allowLocal=false)
> 16/07/18 14:53:13 INFO DAGScheduler: Final stage: ResultStage
> 640(foreachPartition at WriteToKafka.java:45)
> 16/07/18 14:53:13 INFO DAGScheduler: Parents of final stage:
> List(ShuffleMapStage 518, ShuffleMapStage 639)
> 16/07/18 14:53:23 INFO DAGScheduler: Missing parents: List()
> 16/07/18 14:53:23 INFO DAGScheduler: Submitting ResultStage 640
> (MapPartitionsRDD[271] at map at BuildSolrDocs.java:209), which has no
> missing
> parents
> 16/07/18 14:53:23 INFO MemoryStore: ensureFreeSpace(8248) called with
> curMem=41923262, maxMem=2778778828
> 16/07/18 14:53:23 INFO MemoryStore: Block broadcast_90 stored as values in
> memory (estimated size 8.1 KB, free 2.5 GB)
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
> Java heap space
> at
> org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
> at
> org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
> at
> org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
> at
> org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273)
> at
> org.apache.spark.io.SnappyOutputStreamWrapper.flush(CompressionCodec.scala:197)
>     at
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
>
>
> Help needed.
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>
>
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Spark driver getting out of memory

2016-07-18 Thread Saurav Sinha
Hi,

I am running spark job.

Master memory - 5G
executor memort 10G(running on 4 node)

My job is getting killed as no of partition increase to 20K.

16/07/18 14:53:13 INFO DAGScheduler: Got job 17 (foreachPartition at
WriteToKafka.java:45) with 13524 output partitions (allowLocal=false)
16/07/18 14:53:13 INFO DAGScheduler: Final stage: ResultStage
640(foreachPartition at WriteToKafka.java:45)
16/07/18 14:53:13 INFO DAGScheduler: Parents of final stage:
List(ShuffleMapStage 518, ShuffleMapStage 639)
16/07/18 14:53:23 INFO DAGScheduler: Missing parents: List()
16/07/18 14:53:23 INFO DAGScheduler: Submitting ResultStage 640
(MapPartitionsRDD[271] at map at BuildSolrDocs.java:209), which has no
missing
parents
16/07/18 14:53:23 INFO MemoryStore: ensureFreeSpace(8248) called with
curMem=41923262, maxMem=2778778828
16/07/18 14:53:23 INFO MemoryStore: Block broadcast_90 stored as values in
memory (estimated size 8.1 KB, free 2.5 GB)
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
Java heap space
at
org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
at
org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
at
org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
at
org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:273)
at
org.apache.spark.io.SnappyOutputStreamWrapper.flush(CompressionCodec.scala:197)
at
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)


Help needed.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Error in Spark job

2016-07-12 Thread Saurav Sinha
)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: EOFException while reading from HDFS

2016-04-28 Thread Saurav Sinha
//wiki.apache.org/hadoop/ConnectionRefused
>> >
>> >
>> > To me it seemed like this may result from a version mismatch between
>> Spark
>> > Hadoop client and my Hadoop cluster, so I have made the following
>> changes:
>> >
>> >
>> > 1) Added the following lines to conf/spark-env.sh
>> >
>> >
>> > export HADOOP_HOME="/usr/local/hadoop-1.0.4" export
>> > HADOOP_CONF_DIR="$HADOOP_HOME/conf" export
>> > HDFS_URL="hdfs://172.26.49.156:8020"
>> >
>> >
>> > 2) Downloaded Spark 1.6.0, pre-built with user-provided Hadoop, and in
>> > addition to the three lines above, added the following line to
>> > conf/spark-env.sh
>> >
>> >
>> > export SPARK_DIST_CLASSPATH="/usr/local/hadoop-1.0.4/bin/hadoop"
>> >
>> >
>> > but none of it seems to work. However, the following command works from
>> > 172.26.49.55 and gives the directory listing:
>> >
>> > /usr/local/hadoop-1.0.4/bin/hadoop fs -ls hdfs://172.26.49.156:54310/
>> >
>> >
>> > Any suggestion?
>> >
>> >
>> > Thanks
>> >
>> > Bibudh
>> >
>> >
>> > --
>> > Bibudh Lahiri
>> > Data Scientist, Impetus Technolgoies
>> > 5300 Stevens Creek Blvd
>> > San Jose, CA 95129
>> > http://knowthynumbers.blogspot.com/
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Bibudh Lahiri
> Senior Data Scientist, Impetus Technolgoies
> 720 University Avenue, Suite 130
> Los Gatos, CA 95129
> http://knowthynumbers.blogspot.com/
>
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Experts,

I am facing issue in which spark job is running infinitely.

When I start spark job on 4 node cluster.

In which there is no space left on one machine then it is running infinity.

Does any one can across such an issue. Is any why to kill job when such
thing happens.



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Ted,

*Do you have monitoring put in place to detect 'no space left' scenario ?*

No, I don't have any monitoring in place.

*By 'way to kill job', do you mean automatic kill ?*

Yes, I need some way by which my job will detect this failure and kill
itself.

Thanks,
Saurav

On Mon, Oct 12, 2015 at 10:46 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Do you have monitoring put in place to detect 'no space left' scenario ?
>
> By 'way to kill job', do you mean automatic kill ?
>
> Please include the release of Spark, command line for 'spark-submit' in
> your reply.
>
> Thanks
>
> On Mon, Oct 12, 2015 at 10:07 AM, Saurav Sinha <sauravsinh...@gmail.com>
> wrote:
>
>> Hi Experts,
>>
>> I am facing issue in which spark job is running infinitely.
>>
>> When I start spark job on 4 node cluster.
>>
>> In which there is no space left on one machine then it is running
>> infinity.
>>
>> Does any one can across such an issue. Is any why to kill job when such
>> thing happens.
>>
>>
>>
>> --
>> Thanks and Regards,
>>
>> Saurav Sinha
>>
>> Contact: 9742879062
>>
>
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Ted,

Which monitoring service would you suggest  for me.

Thanks,
Saurav

On Mon, Oct 12, 2015 at 11:55 PM, Saurav Sinha <sauravsinh...@gmail.com>
wrote:

> Hi Ted,
>
> Which  would you suggest for monitoring service for me.
>
> Thanks,
> Saurav
>
> On Mon, Oct 12, 2015 at 11:47 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I would suggest you install monitoring service.
>> 'no space left' condition would affect other services, not just Spark.
>>
>> For the second part, Spark experts may have answer for you.
>>
>> On Mon, Oct 12, 2015 at 11:09 AM, Saurav Sinha <sauravsinh...@gmail.com>
>> wrote:
>>
>>> Hi Ted,
>>>
>>> *Do you have monitoring put in place to detect 'no space left' scenario
>>> ?*
>>>
>>> No, I don't have any monitoring in place.
>>>
>>> *By 'way to kill job', do you mean automatic kill ?*
>>>
>>> Yes, I need some way by which my job will detect this failure and kill
>>> itself.
>>>
>>> Thanks,
>>> Saurav
>>>
>>> On Mon, Oct 12, 2015 at 10:46 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Do you have monitoring put in place to detect 'no space left' scenario
>>>> ?
>>>>
>>>> By 'way to kill job', do you mean automatic kill ?
>>>>
>>>> Please include the release of Spark, command line for 'spark-submit' in
>>>> your reply.
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Oct 12, 2015 at 10:07 AM, Saurav Sinha <sauravsinh...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Experts,
>>>>>
>>>>> I am facing issue in which spark job is running infinitely.
>>>>>
>>>>> When I start spark job on 4 node cluster.
>>>>>
>>>>> In which there is no space left on one machine then it is running
>>>>> infinity.
>>>>>
>>>>> Does any one can across such an issue. Is any why to kill job when
>>>>> such thing happens.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>>
>>>>> Saurav Sinha
>>>>>
>>>>> Contact: 9742879062
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>>
>>> Saurav Sinha
>>>
>>> Contact: 9742879062
>>>
>>
>>
>
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Ted,

Which  would you suggest for monitoring service for me.

Thanks,
Saurav

On Mon, Oct 12, 2015 at 11:47 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I would suggest you install monitoring service.
> 'no space left' condition would affect other services, not just Spark.
>
> For the second part, Spark experts may have answer for you.
>
> On Mon, Oct 12, 2015 at 11:09 AM, Saurav Sinha <sauravsinh...@gmail.com>
> wrote:
>
>> Hi Ted,
>>
>> *Do you have monitoring put in place to detect 'no space left' scenario ?*
>>
>> No, I don't have any monitoring in place.
>>
>> *By 'way to kill job', do you mean automatic kill ?*
>>
>> Yes, I need some way by which my job will detect this failure and kill
>> itself.
>>
>> Thanks,
>> Saurav
>>
>> On Mon, Oct 12, 2015 at 10:46 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Do you have monitoring put in place to detect 'no space left' scenario ?
>>>
>>> By 'way to kill job', do you mean automatic kill ?
>>>
>>> Please include the release of Spark, command line for 'spark-submit' in
>>> your reply.
>>>
>>> Thanks
>>>
>>> On Mon, Oct 12, 2015 at 10:07 AM, Saurav Sinha <sauravsinh...@gmail.com>
>>> wrote:
>>>
>>>> Hi Experts,
>>>>
>>>> I am facing issue in which spark job is running infinitely.
>>>>
>>>> When I start spark job on 4 node cluster.
>>>>
>>>> In which there is no space left on one machine then it is running
>>>> infinity.
>>>>
>>>> Does any one can across such an issue. Is any why to kill job when such
>>>> thing happens.
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks and Regards,
>>>>
>>>> Saurav Sinha
>>>>
>>>> Contact: 9742879062
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks and Regards,
>>
>> Saurav Sinha
>>
>> Contact: 9742879062
>>
>
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Master getting down with Memory issue.

2015-09-28 Thread Saurav Sinha
Hi Spark Users,

I am running some spark jobs which is running every hour.After running for
12 hours master is getting killed giving exception as

*java.lang.OutOfMemoryError: GC overhead limit exceeded*

It look like there is some memory issue in spark master.
Spark Master is blocker. Any one please suggest me any thing.


Same kind of issue I noticed with spark history server.

In my job I have to monitor if job completed successfully, for that I am
hitting curl to get status but when no of jobs has increased to >80 apps
history server start responding with delay.Like it is taking more then 5
min to respond status of jobs.

Running spark 1.4.1 in standalone mode on 5 machine cluster.

Kindly suggest me solution for memory issue it is blocker.

Thanks,
Saurav Sinha

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Master getting down with Memory issue.

2015-09-28 Thread Saurav Sinha
Hi Akhil,

Can you please explaine to me how increasing number of partition (which is
thing is worker nodes) will help.

As issue is that my master is getting OOM.

Thanks,
Saurav Sinha

On Mon, Sep 28, 2015 at 2:32 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> This behavior totally depends on the job that you are doing. Usually
> increasing the # of partitions will sort out this issue. It would be good
> if you can paste the code snippet or explain what type of operations that
> you are doing.
>
> Thanks
> Best Regards
>
> On Mon, Sep 28, 2015 at 11:37 AM, Saurav Sinha <sauravsinh...@gmail.com>
> wrote:
>
>> Hi Spark Users,
>>
>> I am running some spark jobs which is running every hour.After running
>> for 12 hours master is getting killed giving exception as
>>
>> *java.lang.OutOfMemoryError: GC overhead limit exceeded*
>>
>> It look like there is some memory issue in spark master.
>> Spark Master is blocker. Any one please suggest me any thing.
>>
>>
>> Same kind of issue I noticed with spark history server.
>>
>> In my job I have to monitor if job completed successfully, for that I am
>> hitting curl to get status but when no of jobs has increased to >80 apps
>> history server start responding with delay.Like it is taking more then 5
>> min to respond status of jobs.
>>
>> Running spark 1.4.1 in standalone mode on 5 machine cluster.
>>
>> Kindly suggest me solution for memory issue it is blocker.
>>
>> Thanks,
>> Saurav Sinha
>>
>> --
>> Thanks and Regards,
>>
>> Saurav Sinha
>>
>> Contact: 9742879062
>>
>
>


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: Unreachable dead objects permanently retained on heap

2015-09-25 Thread Saurav Sinha
Hi Spark Users,

I am running some spark jobs which is running every hour.After running for
12 hours master is getting killed giving exception as

*java.lang.OutOfMemoryError: GC overhead limit exceeded*

It look like there is some memory issue in spark master.

Same kind of issue I noticed with spark history server.

In my job I have to monitor if job completed successfully, for that I am
hitting curl to get status but when no of jobs has increased to >80 apps
history server start responding with delay.Like it is taking more then 5
min to respond status of jobs.

Running spark 1.4.1 in standalone mode on 5 machine cluster.

Kindly suggest me solution for memory issue it is blocker.

Thanks,
Saurav Sinha

On Fri, Sep 25, 2015 at 5:01 PM, James Aley <james.a...@swiftkey.com> wrote:

> Hi,
>
> We have an application that submits several thousands jobs within the same
> SparkContext, using a thread pool to run about 50 in parallel. We're
> running on YARN using Spark 1.4.1 and seeing a problem where our driver is
> killed by YARN due to running beyond physical memory limits (no Java OOM
> stack trace though).
>
> Plugging in YourKit, I can see that in fact the application is running low
> on heap. The suspicious thing we're seeing is that the old generation is
> filling up with dead objects, which don't seem to be fully removed during
> the stop-the-world sweeps we see happening later in the running of the
> application.
>
> With allocation tracking enabled, I can see that maybe 80%+ of that dead
> heap space consists of byte arrays, which appear to contain some
> snappy-compressed Hadoop configuration data. Many of them are 4MB each,
> other hundreds of KBs. The allocation tracking reveals that they were
> originally allocated in calls to sparkContext.hadoopFile() (from
> AvroRelation in spark-avro). It seems that this data was broadcast to the
> executors as a result of that call? I'm not clear on the implementation
> details, but I can imagine that might be necessary?
>
> This application is essentially a batch job to take many Avro files and
> merging them into larger Parquet files. What it does is builds a DataFrame
> of Avro files, then for each DataFrame, starts a job using
> .coalesce(N).write().parquet() on a fixed size thread pool.
>
> It seems that for each of those calls, another chunk of heap space
> disappears to one of these byte arrays and is never reclaimed. I understand
> that broadcast variables remain in memory on the driver application in
> their serialized form, and that at least appears to be consistent with what
> I'm seeing here. Question is, what can we do about this? Is there a way to
> reclaim this memory? Should those arrays be GC'ed when jobs finish?
>
> Any guidance greatly appreciated.
>
>
> Many thanks,
>
> James.
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Fwd: Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
Hi Users,

I am new Spark I have written flow.When we deployed our code it is
completing jobs in 4-5 min. But now it is taking 20+ min in completing with
almost same set of data. Can you please help me to figure out reason for it.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
Hi Users,

I am new Spark I have written flow.When we deployed our code it is
completing jobs in 4-5 min. But now it is taking 20+ min in completing with
almost same set of data. Can you please help me to figure out reason for it.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Fwd: Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
-- Forwarded message --
From: "Saurav Sinha" <sauravsinh...@gmail.com>
Date: 21-Sep-2015 11:48 am
Subject: Issue with high no of skipped task
To: <user@spark.apache.org>
Cc:


Hi Users,

I am new Spark I have written flow.When we deployed our code it is
completing jobs in 4-5 min. But now it is taking 20+ min in completing with
almost same set of data. Can you please help me to figure out reason for it.

-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062