from:"Yash Sharma"

Re: Unsubscribe

2018-01-18 Thread Yash Sharma

Please send mail to user-unsubscr...@spark.apache.org to unsubscribe.

Cheers

On Fri., 19 Jan. 2018, 5:11 pm Anu B Nair,  wrote:

>

Re: Unsubscribe

2018-01-18 Thread Yash Sharma

Please send mail to user-unsubscr...@spark.apache.org to unsubscribe.

Cheers

On Fri., 19 Jan. 2018, 5:28 pm Sbf xyz,  wrote:

>

Re: Quick one... AWS SDK version?

2017-10-03 Thread Yash Sharma

Hi JG,
Here are my cluster configs if it helps.

Cheers.

EMR: emr-5.8.0
Hadoop distribution: Amazon 2.7.3
AWS sdk: /usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.160.jar

Applications:
Hive 2.3.0
Spark 2.2.0
Tez 0.8.4


On Tue, 3 Oct 2017 at 12:29 JG Perrin  wrote:

> Hey Sparkians,
>
>
>
> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
> the Hadoop 2.7.3 libs?
>
>
>
> Thanks!
>
>
>
> jg
>

Re: Error while reading the CSV

2017-04-07 Thread Yash Sharma

Sorry buddy, didn't get your question quite right.
Just to test, I created a scala class with spark csv and it seemed to work.

Donno if that would help much, but here are the env details:
EMR 2.7.3
scalaVersion := "2.11.8"
Spark version 2.0.2



On Fri, 7 Apr 2017 at 17:51 nayan sharma <nayansharm...@gmail.com> wrote:

> Hi Yash,
> I know this will work perfect but here I wanted  to read the csv using the
> assembly jar file.
>
> Thanks,
> Nayan
>
> On 07-Apr-2017, at 10:02 AM, Yash Sharma <yash...@gmail.com> wrote:
>
> Hi Nayan,
> I use the --packages with the spark shell and the spark submit. Could you
> please try that and let us know:
> Command:
>
> spark-submit --packages com.databricks:spark-csv_2.11:1.4.0
>
>
> On Fri, 7 Apr 2017 at 00:39 nayan sharma <nayansharm...@gmail.com> wrote:
>
> spark version 1.6.2
> scala version 2.10.5
>
> On 06-Apr-2017, at 8:05 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> And which version does your Spark cluster use?
>
> On 6. Apr 2017, at 16:11, nayan sharma <nayansharm...@gmail.com> wrote:
>
> scalaVersion := “2.10.5"
>
>
>
>
>
> On 06-Apr-2017, at 7:35 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or
> the other way around?
>
> On 6. Apr 2017, at 15:54, nayan sharma <nayansharm...@gmail.com> wrote:
>
> In addition I am using spark version 1.6.2
> Is there any chance of error coming because of Scala version or
> dependencies are not matching.?I just guessed.
>
> Thanks,
> Nayan
>
>
>
> On 06-Apr-2017, at 7:16 PM, nayan sharma <nayansharm...@gmail.com> wrote:
>
> Hi Jorn,
> Thanks for replying.
>
> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>
> after doing this I have found a lot of classes under
> com/databricks/spark/csv/
>
> do I need to check for any specific class ??
>
> Regards,
> Nayan
>
> On 06-Apr-2017, at 6:42 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Is the library in your assembly jar?
>
> On 6. Apr 2017, at 15:06, nayan sharma <nayansharm...@gmail.com> wrote:
>
> Hi All,
> I am getting error while loading CSV file.
>
> val
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("timeline.csv")
> java.lang.NoSuchMethodError:
> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>
>
> I have added the dependencies in sbt file
>
> // Spark Additional Library - CSV Read as DFlibraryDependencies += 
> "com.databricks" %% "spark-csv" % “1.5.0"
>
> *and starting the spark-shell with command*
>
> spark-shell --master yarn-client  --jars
> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
> --name nayan
>
>
>
> Thanks for any help!!
>
>
> Thanks,
> Nayan
>
>
>
>
>
>
>

Re: distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-06 Thread Yash Sharma

Hi Ramesh,
Could you share some logs please? pastebin ? dag view ?
Did you check for GC pauses if any.

On Thu, 6 Apr 2017 at 21:55 Ramesh Krishnan  wrote:

> I have a use case of distinct on a dataframe. When i run the application
> is getting stuck at  LINE *ShuffleBlockFetcherIterator: Started 4 remote
> fetches *forever.
>
> Can someone help .
>
>
> Thanks
> Ramesh
>

Re: Error while reading the CSV

2017-04-06 Thread Yash Sharma

Hi Nayan,
I use the --packages with the spark shell and the spark submit. Could you
please try that and let us know:
Command:

spark-submit --packages com.databricks:spark-csv_2.11:1.4.0


On Fri, 7 Apr 2017 at 00:39 nayan sharma  wrote:

> spark version 1.6.2
> scala version 2.10.5
>
> On 06-Apr-2017, at 8:05 PM, Jörn Franke  wrote:
>
> And which version does your Spark cluster use?
>
> On 6. Apr 2017, at 16:11, nayan sharma  wrote:
>
> scalaVersion := “2.10.5"
>
>
>
>
>
> On 06-Apr-2017, at 7:35 PM, Jörn Franke  wrote:
>
> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or
> the other way around?
>
> On 6. Apr 2017, at 15:54, nayan sharma  wrote:
>
> In addition I am using spark version 1.6.2
> Is there any chance of error coming because of Scala version or
> dependencies are not matching.?I just guessed.
>
> Thanks,
> Nayan
>
>
>
> On 06-Apr-2017, at 7:16 PM, nayan sharma  wrote:
>
> Hi Jorn,
> Thanks for replying.
>
> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>
> after doing this I have found a lot of classes under
> com/databricks/spark/csv/
>
> do I need to check for any specific class ??
>
> Regards,
> Nayan
>
> On 06-Apr-2017, at 6:42 PM, Jörn Franke  wrote:
>
> Is the library in your assembly jar?
>
> On 6. Apr 2017, at 15:06, nayan sharma  wrote:
>
> Hi All,
> I am getting error while loading CSV file.
>
> val
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("timeline.csv")
> java.lang.NoSuchMethodError:
> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>
>
> I have added the dependencies in sbt file
>
> // Spark Additional Library - CSV Read as DFlibraryDependencies += 
> "com.databricks" %% "spark-csv" % “1.5.0"
>
> *and starting the spark-shell with command*
>
> spark-shell --master yarn-client  --jars
> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
> --name nayan
>
>
>
> Thanks for any help!!
>
>
> Thanks,
> Nayan
>
>
>
>
>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Yash Sharma

Hi Shyla,
We could suggest based on what you're trying to do exactly. But with the
given information - If you have your spark job ready you could schedule it
via any scheduling framework like Airflow or Celery or Cron based on how
simple/complex you want your work flow to be.

Cheers,
Yash

On Fri, 7 Apr 2017 at 10:04 shyla deshpande 
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread Yash Sharma

We have too many (large)  files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.

I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's Max value,
 was wondering if we can compute an appropriate max executor count based on
the cluster resources.

Would be happy to contribute back if I can get some info on the executors
requests.

Cheers


On Sat, Sep 24, 2016, 6:39 PM ayan guha <guha.a...@gmail.com> wrote:

> Do you have too many small files you are trying to read? Number of
> executors are very high
> On 24 Sep 2016 10:28, "Yash Sharma" <yash...@gmail.com> wrote:
>
>> Have been playing around with configs to crack this. Adding them here
>> where it would be helpful to others :)
>> Number of executors and timeout seemed like the core issue.
>>
>> {code}
>> --driver-memory 4G \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.maxExecutors=500 \
>> --conf spark.core.connection.ack.wait.timeout=6000 \
>> --conf spark.akka.heartbeat.interval=6000 \
>> --conf spark.akka.frameSize=100 \
>> --conf spark.akka.timeout=6000 \
>> {code}
>>
>> Cheers !
>>
>> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in>
>> wrote:
>>
>>> For testing purpose can you run with fix number of executors and try.
>>> May be 12 executors for testing and let know the status.
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com>
>>> wrote:
>>>
>>> Thanks Aditya, appreciate the help.
>>>>
>>>> I had the exact thought about the huge number of executors requested.
>>>> I am going with the dynamic executors and not specifying the number of
>>>> executors. Are you suggesting that I should limit the number of executors
>>>> when the dynamic allocator requests for more number of executors.
>>>>
>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <
>>>> aditya.calangut...@augmentiq.co.in> wrote:
>>>>
>>>>> Hi Yash,
>>>>>
>>>>> What is your total cluster memory and number of cores?
>>>>> Problem might be with the number of executors you are allocating. The
>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>> executors.
>>>>>
>>>>>
>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>
>>>>>> Hi All,
>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>> allocation enabled.
>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>> starts*.
>>>>>>
>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>> below.
>>>>>>
>>>>>> Thanks All.
>>>>>>
>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>> before that its just blank page.
>>>>>>
>>>>>> Logs here -
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>>> of 168510 executor(s).
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB 
>>>>>> overhead
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 22
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 19
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 18
>>>>>>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Hi Dhruve, thanks.
I've solved the issue with adding max executors.
I wanted to find some place where I can add this behavior in Spark so that
user should not have to worry about the max executors.

Cheers

- Thanks, via mobile,  excuse brevity.

On Sep 24, 2016 1:15 PM, "dhruve ashar" <dhruveas...@gmail.com> wrote:

> From your log, its trying to launch every executor with approximately
> 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x
> 6.6GB is unrealistic for a 12 node cluster.
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>
> I don't know the size of the data that you are processing here.
>
> Here are some general choices that I would start with.
>
> Start with a smaller no. of minimum executors and assign them reasonable
> memory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
>
>
> On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma <yash...@gmail.com> wrote:
>
>> Is there anywhere I can help fix this ?
>>
>> I can see the requests being made in the yarn allocator. What should be
>> the upperlimit of the requests made ?
>>
>> https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222
>>
>> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <yash...@gmail.com> wrote:
>>
>>> Have been playing around with configs to crack this. Adding them here
>>> where it would be helpful to others :)
>>> Number of executors and timeout seemed like the core issue.
>>>
>>> {code}
>>> --driver-memory 4G \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.maxExecutors=500 \
>>> --conf spark.core.connection.ack.wait.timeout=6000 \
>>> --conf spark.akka.heartbeat.interval=6000 \
>>> --conf spark.akka.frameSize=100 \
>>> --conf spark.akka.timeout=6000 \
>>> {code}
>>>
>>> Cheers !
>>>
>>> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in>
>>> wrote:
>>>
>>>> For testing purpose can you run with fix number of executors and try.
>>>> May be 12 executors for testing and let know the status.
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com
>>>> > wrote:
>>>>
>>>> Thanks Aditya, appreciate the help.
>>>>>
>>>>> I had the exact thought about the huge number of executors requested.
>>>>> I am going with the dynamic executors and not specifying the number of
>>>>> executors. Are you suggesting that I should limit the number of executors
>>>>> when the dynamic allocator requests for more number of executors.
>>>>>
>>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>>>> co.in> wrote:
>>>>>
>>>>>> Hi Yash,
>>>>>>
>>>>>> What is your total cluster memory and number of cores?
>>>>>> Problem might be with the number of executors you are allocating. The
>>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>>> executors.
>>>>>>
>>>>>>
>>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>>> allocation enabled.
>>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>>> starts*.
>>>>>>>
>>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>>> below.
>>>>>>>
>>>>>>> Thanks All.
>>>>>>>
>>>>>>> * - by s

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Is there anywhere I can help fix this ?

I can see the requests being made in the yarn allocator. What should be the
upperlimit of the requests made ?

https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222

On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <yash...@gmail.com> wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in>
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com>
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>> co.in> wrote:
>>>
>>>> Hi Yash,
>>>>
>>>> What is your total cluster memory and number of cores?
>>>> Problem might be with the number of executors you are allocating. The
>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>> executors.
>>>>
>>>>
>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>
>>>>> Hi All,
>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>> allocation enabled.
>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>> starts*.
>>>>>
>>>>> Is there anything I can check to debug this problem. There is not a
>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>> below.
>>>>>
>>>>> Thanks All.
>>>>>
>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>> before that its just blank page.
>>>>>
>>>>> Logs here -
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>> of 168510 executor(s).
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 22
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 19
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 18
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 12
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 11
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 20
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 15
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 7
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Have been playing around with configs to crack this. Adding them here where
it would be helpful to others :)
Number of executors and timeout seemed like the core issue.

{code}
--driver-memory 4G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=500 \
--conf spark.core.connection.ack.wait.timeout=6000 \
--conf spark.akka.heartbeat.interval=6000 \
--conf spark.akka.frameSize=100 \
--conf spark.akka.timeout=6000 \
{code}

Cheers !

On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in> wrote:

> For testing purpose can you run with fix number of executors and try. May
> be 12 executors for testing and let know the status.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com>
> wrote:
>
> Thanks Aditya, appreciate the help.
>>
>> I had the exact thought about the huge number of executors requested.
>> I am going with the dynamic executors and not specifying the number of
>> executors. Are you suggesting that I should limit the number of executors
>> when the dynamic allocator requests for more number of executors.
>>
>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>
>>
>>
>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>> co.in> wrote:
>>
>>> Hi Yash,
>>>
>>> What is your total cluster memory and number of cores?
>>> Problem might be with the number of executors you are allocating. The
>>> logs shows it as 168510 which is on very high side. Try reducing your
>>> executors.
>>>
>>>
>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>
>>>> Hi All,
>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>> allocation enabled.
>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>> starts*.
>>>>
>>>> Is there anything I can check to debug this problem. There is not a lot
>>>> of information in logs for the exact cause but here is some snapshot below.
>>>>
>>>> Thanks All.
>>>>
>>>> * - by starts I mean when it shows something on the spark web ui,
>>>> before that its just blank page.
>>>>
>>>> Logs here -
>>>>
>>>> {code}
>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>> of 168510 executor(s).
>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 22
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 19
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 18
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 12
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 11
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 20
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 15
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 7
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 8
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 16
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 21
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 6
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 13
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 14
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Yash Sharma

Correction -
dataDF.write.partitionBy(“year”, “month”,
“date”).mode(SaveMode.Append).text(“s3://data/test2/events/”)

On Tue, Jul 26, 2016 at 10:59 AM, Yash Sharma <yash...@gmail.com> wrote:

> Based on the behavior of spark [1], Overwrite mode will delete all your
> data when you try to overwrite a particular partition.
>
> What I did-
> - Use S3 api to delete all partitions
> - Use spark df to write in Append mode [2]
>
>
> 1.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html
>
> 2. dataDF.write.partitionBy(“year”, “month”,
> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)
>
> On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
> wrote:
>
>> Probably should have been more specific with the code we are using, which
>> is something like
>>
>> val df = 
>> df.write.mode("append or overwrite
>> here").partitionBy("date").saveAsTable("my_table")
>>
>> Unless there is something like what I described on the native API, I will
>> probably take the approach of having a S3 API call to wipe out that
>> partition before the job starts, but it would be nice to not have to
>> incorporate another step in the job.
>>
>> Pedro
>>
>> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkad...@collectivei.com>
>> wrote:
>>
>>> You can have a temporary file to capture the data that you would like to
>>> overwrite. And swap that with existing partition that you would want to
>>> wipe the data away. Swapping can be done by simple rename of the partition
>>> and just repair the table to pick up the new partition.
>>>
>>> Am not sure if that addresses your scenario.
>>>
>>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <ski.rodrig...@gmail.com>
>>> wrote:
>>>
>>> What would be the best way to accomplish the following behavior:
>>>
>>> 1. There is a table which is partitioned by date
>>> 2. Spark job runs on a particular date, we would like it to wipe out all
>>> data for that date. This is to make the job idempotent and lets us rerun a
>>> job if it failed without fear of duplicated data
>>> 3. Preserve data for all other dates
>>>
>>> I am guessing that overwrite would not work here or if it does its not
>>> guaranteed to stay that way, but am not sure. If thats the case, is there a
>>> good/robust way to get this behavior?
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>>
>>> Collective[i] dramatically improves sales and marketing performance
>>> using technology, applications and a revolutionary network designed to
>>> provide next generation analytics and decision-support directly to business
>>> users. Our goal is to maximize human potential and minimize mistakes. In
>>> most cases, the results are astounding. We cannot, however, stop emails
>>> from sometimes being sent to the wrong person. If you are not the intended
>>> recipient, please notify us by replying to this email's sender and deleting
>>> it (and any attachments) permanently from your system. If you are, please
>>> respect the confidentiality of this communication's contents.
>>
>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Yash Sharma

Based on the behavior of spark [1], Overwrite mode will delete all your
data when you try to overwrite a particular partition.

What I did-
- Use S3 api to delete all partitions
- Use spark df to write in Append mode [2]


1.
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html

2. dataDF.write.partitionBy(“year”, “month”,
“date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)

On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez 
wrote:

> Probably should have been more specific with the code we are using, which
> is something like
>
> val df = 
> df.write.mode("append or overwrite
> here").partitionBy("date").saveAsTable("my_table")
>
> Unless there is something like what I described on the native API, I will
> probably take the approach of having a S3 API call to wipe out that
> partition before the job starts, but it would be nice to not have to
> incorporate another step in the job.
>
> Pedro
>
> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri  wrote:
>
>> You can have a temporary file to capture the data that you would like to
>> overwrite. And swap that with existing partition that you would want to
>> wipe the data away. Swapping can be done by simple rename of the partition
>> and just repair the table to pick up the new partition.
>>
>> Am not sure if that addresses your scenario.
>>
>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez 
>> wrote:
>>
>> What would be the best way to accomplish the following behavior:
>>
>> 1. There is a table which is partitioned by date
>> 2. Spark job runs on a particular date, we would like it to wipe out all
>> data for that date. This is to make the job idempotent and lets us rerun a
>> job if it failed without fear of duplicated data
>> 3. Preserve data for all other dates
>>
>> I am guessing that overwrite would not work here or if it does its not
>> guaranteed to stay that way, but am not sure. If thats the case, is there a
>> good/robust way to get this behavior?
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>> Collective[i] dramatically improves sales and marketing performance using
>> technology, applications and a revolutionary network designed to provide
>> next generation analytics and decision-support directly to business users.
>> Our goal is to maximize human potential and minimize mistakes. In most
>> cases, the results are astounding. We cannot, however, stop emails from
>> sometimes being sent to the wrong person. If you are not the intended
>> recipient, please notify us by replying to this email's sender and deleting
>> it (and any attachments) permanently from your system. If you are, please
>> respect the confidentiality of this communication's contents.
>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: Streaming from Kinesis is not getting data in Yarn cluster

2016-07-15 Thread Yash Sharma

I struggled with kinesis for a long time and got all my findings documented
at -

http://stackoverflow.com/questions/35567440/spark-not-able-to-fetch-events-from-amazon-kinesis

Let me know if it helps.

Cheers,
Yash

- Thanks, via mobile,  excuse brevity.

On Jul 16, 2016 6:05 AM, "dharmendra"  wrote:

> I have created small spark streaming program to fetch data from Kinesis and
> put some data in database.
> When i ran it in spark standalone cluster using master as local[*] it is
> working fine but when i tried to run in yarn cluster with master as "yarn"
> application doesn't receive any data.
>
> I submit job using following command
> spark-submit --class <> --master yarn --deploy-mode cluster
> --queue default --executor-cores 2 --executor-memory 2G --num-executors 4
> <
>
> My java code is like
>
> JavaDStream enrichStream =
> javaDStream.flatMap(sparkRecordProcessor);
>
> enrichStream.mapToPair(new PairFunction Integer>()
> {
> @Override
> public Tuple2 call(Aggregation s) throws
> Exception {
> LOGGER.info("creating tuple " + s);
> return new Tuple2<>(s, 1);
> }
> }).reduceByKey(new Function2() {
> @Override
> public Integer call(Integer i1, Integer i2) throws Exception {
> LOGGER.info("reduce by key {}, {}", i1, i2);
> return i1 + i2;
> }
> }).foreach(sparkDatabaseProcessor);
>
>
> I have put some logs in sparkRecordProcessor and sparkDatabaseProcessor.
> I can see that sparkDatabaseProcessor executed every batch interval(10 sec)
> and but find no log in sparkRecordProcessor.
> There is no event(avg/sec) in Spark Streaming UI.
> In Executor tab i can see 3 executors. Data against these executors are
> also
> continuously updated.
> I also check Dynamodb table in Amazon and leaseCounter is updated regularly
> from my application.
> But spark streaming gets no data from Kinesis in yarn.
> I see "shuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0
> blocks" many times in log.
> I don't know what else i need to do to run spark streaming on yarn.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-from-Kinesis-is-not-getting-data-in-Yarn-cluster-tp27345.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Error in Spark job

2016-07-12 Thread Yash Sharma

Looks like the write to Aerospike is taking too long.

Could you try writing the rdd directly to filesystem, skipping the
Aerospike write.

foreachPartition at WriteToAerospike.java:47, took 338.345827 s

- Thanks, via mobile,  excuse brevity.
On Jul 12, 2016 8:08 PM, "Saurav Sinha"  wrote:

> Hi,
>
> I am getting into an issue where job is running in multiple partition
> around 21000 parts.
>
>
> Setting
>
> Driver = 5G
> Executor memory = 10G
> Total executor core =32
> It us falling when I am trying to write to aerospace earlier it is working
> fine. I am suspecting number of partition as reason.
>
> Kindly help to solve this.
>
> It is giving error :
>
>
> 16/07/12 14:53:54 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 37 is 9436142 bytes
> 16/07/12 14:58:46 WARN HeartbeatReceiver: Removing executor 0 with no
> recent heartbeats: 150060 ms exceeds timeout 12 ms
> 16/07/12 14:58:48 WARN DAGScheduler: Creating new stage failed due to
> exception - job: 14
> java.lang.IllegalStateException: unread block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$2.apply(MapOutputTracker.scala:371)
> at
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$2.apply(MapOutputTracker.scala:371)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
> at
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:372)
> at
> org.apache.spark.scheduler.DAGScheduler.newOrUsedShuffleStage(DAGScheduler.scala:292)
> at
> org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:343)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:221)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:324)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:321)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:321)
> at
> org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:333)
> at
> org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:234)
> at
> org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:270)
> at
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:768)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1426)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 16/07/12 14:58:48 ERROR TaskSchedulerImpl: Lost executor 0 : Executor
> heartbeat timed out after 150060 ms
> 16/07/12 14:58:48 INFO DAGScheduler: Job 14 failed: foreachPartition at
> WriteToAerospike.java:47, took 338.345827 s
> 16/07/12 14:58:48 ERROR MinervaLauncher: Job failed due to exception
> =java.lang.IllegalStateException: unread block data
> java.lang.IllegalStateException: unread block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$2.apply(MapOutputTracker.scala:371)
> at
> org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$2.apply(MapOutputTracker.scala:371)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
> at
> org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:372)
> at
> org.apache.spark.scheduler.DAGScheduler.newOrUsedShuffleStage(DAGScheduler.scala:292)
> at
> org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:343)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:221)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:324)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:321)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:321)
> at
>

Re: Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Yash Sharma

This answers exactly what you are looking for -

http://stackoverflow.com/a/34204640/1562474

On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez 
wrote:

> Is it possible with Spark SQL to merge columns whose types are Arrays or
> Sets?
>
> My use case would be something like this:
>
> DF types
> id: String
> words: Array[String]
>
> I would want to do something like
>
> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Yash Sharma

Spark is more of an execution engine rather than a database. Hive is a data
warehouse but I still like treating it as an execution engine.

For databases, You could compare HBase and Cassandra as they both have very
wide usage and proven performance. We have used Cassandra in the past and
were very happy with the results. You should move this discussion on
Cassandra's/HBase's mailing list for better advice.

Cheers

On Tue, Jul 12, 2016 at 3:23 PM, ayan guha  wrote:

> HI
>
> HBase is pretty neat itself. But speed is not the criteria to choose Hbase
> over Cassandra (or vicey versa).. Slowness can very well because of design
> issues, and unfortunately it will not help changing technology in that case
> :)
>
> I would suggest you to quantify "slow"-ness in conjunction
> with infrastructure you have and I am sure good people here will help.
>
> Best
> Ayan
>
> On Tue, Jul 12, 2016 at 3:01 PM, Ashok Kumar  > wrote:
>
>> Anyone in Spark as well
>>
>> My colleague has been using Cassandra. However, he says it is too slow
>> and not user friendly/
>> MongodDB as a doc databases is pretty neat but not fast enough
>>
>> May main concern is fast writes per second and good scaling.
>>
>>
>> Hive on Spark or Tez?
>>
>> How about Hbase. or anything else
>>
>> Any expert advice warmly acknowledged..
>>
>> thanking yo
>>
>>
>> On Monday, 11 July 2016, 17:24, Ashok Kumar  wrote:
>>
>>
>> Hi Gurus,
>>
>> Advice appreciated from Hive gurus.
>>
>> My colleague has been using Cassandra. However, he says it is too slow
>> and not user friendly/
>> MongodDB as a doc databases is pretty neat but not fast enough
>>
>> May main concern is fast writes per second and good scaling.
>>
>>
>> Hive on Spark or Tez?
>>
>> How about Hbase. or anything else
>>
>> Any expert advice warmly acknowledged..
>>
>> thanking you
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Spark cluster tuning recommendation

2016-07-11 Thread Yash Sharma

I would say use the dynamic allocation rather than number of executors.
Provide some executor memory which you would like.
Deciding the values requires couple of test runs and checking what works
best for you.

You could try something like -

--driver-memory 1G \
--executor-memory 2G \
--executor-cores 2 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=8 \



On Tue, Jul 12, 2016 at 1:27 PM, Anuj Kumar  wrote:

> That configuration looks bad. With only two cores in use and 1GB used by
> the app. Few points-
>
> 1. Please oversubscribe those CPUs to at-least twice the amount of cores
> you have to start-with and then tune if it freezes
> 2. Allocate all of the CPU cores and memory to your running app (I assume
> it is your test environment)
> 3. Assuming that you are running a quad core machine if you define cores
> as 8 for your workers you will get 56 cores (CPU threads)
> 4. Also, it depends on the source from where you are reading the data. If
> you are reading from HDFS, what is your block size and part count?
> 5. You may also have to tune the timeouts and frame-size based on the
> dataset and errors that you are facing
>
> We have run terasort with couple of high-end worker machines RW from HDFS
> with 5-10 mount points allocated for HDFS and Spark local. We have used
> multiple configuration, like-
> 10W-10CPU-10GB, 25W-6CPU-6GB running on each of the two machines with HDFS
> 512MB blocks and 1000-2000 parts. All these guys chatting at 10Gbe, worked
> well.
>
> On Tue, Jul 12, 2016 at 3:39 AM, Kartik Mathur 
> wrote:
>
>> I am trying a run terasort in spark , for a 7 node cluster with only 10g
>> of data and executors get lost with GC overhead limit exceeded error.
>>
>> This is what my cluster looks like -
>>
>>
>>- *Alive Workers:* 7
>>- *Cores in use:* 28 Total, 2 Used
>>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used
>>- *Applications:* 1 Running, 6 Completed
>>- *Drivers:* 0 Running, 0 Completed
>>- *Status:* ALIVE
>>
>> Each worker has 8 cores and 4GB memory.
>>
>> My questions is how do people running in production decide these
>> properties -
>>
>> 1) --num-executors
>> 2) --executor-cores
>> 3) --executor-memory
>> 4) num of partitions
>> 5) spark.default.parallelism
>>
>> Thanks,
>> Kartik
>>
>>
>>
>

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

I cannot get a lot of info from these logs but it surely seems like yarn
setup issue. Did you try the local mode to check if it works -

./bin/spark-submit \
> --class org.apache.spark.examples.SparkPi \
> --master local[4] \
> spark-examples-1.6.1-hadoop2.6.0.jar 10


Note - the jar is a local one

On Wed, Jun 22, 2016 at 4:50 PM, 另一片天 <958943...@qq.com> wrote:

>  Application application_1466568126079_0006 failed 2 times due to AM
> Container for appattempt_1466568126079_0006_02 exited with exitCode: 1
> For more detailed output, check application tracking page:
> http://master:8088/proxy/application_1466568126079_0006/Then, click on
> links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_1466568126079_0006_02_01
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> Failing this attempt. Failing the application.
>
> but command get error
>
> shihj@master:~/workspace/hadoop-2.6.4$ yarn logs -applicationId
> application_1466568126079_0006
> Usage: yarn [options]
>
> yarn: error: no such option: -a
>
>
>
> -- 原始邮件 --
> *发件人:* "Yash Sharma";<yash...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 下午2:46
> *收件人:* "另一片天"<958943...@qq.com>;
> *抄送:* "Saisai Shao"<sai.sai.s...@gmail.com>; "user"<user@spark.apache.org>;
>
> *主题:* Re: Could not find or load main class
> org.apache.spark.deploy.yarn.ExecutorLauncher
>
> Are you able to run anything else on the cluster, I suspect its yarn that
> not able to run the class. If you could just share the logs in pastebin we
> could confirm that.
>
> On Wed, Jun 22, 2016 at 4:43 PM, 另一片天 <958943...@qq.com> wrote:
>
>> i  want to avoid Uploading resource file （especially jar package），because
>> them very big，the application will wait for too long，there are good method？？
>> so i config that para， but not get the my want to effect。
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>> *发送时间:* 2016年6月22日(星期三) 下午2:34
>> *收件人:* "另一片天"<958943...@qq.com>;
>> *抄送:* "Saisai Shao"<sai.sai.s...@gmail.com>; "user"<user@spark.apache.org>;
>>
>> *主题:* Re: Could not find or load main class
>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>
>> Try with : --master yarn-cluster
>>
>> On Wed, Jun 22, 2016 at 4:30 PM, 另一片天 <958943...@qq.com> wrote:
>>
>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>>> --executor-cores 2
>>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>> 10
>>> Warning: Skip remote jar
>>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar.
>>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:348)
>>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>> at org.apache.s

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

Are you able to run anything else on the cluster, I suspect its yarn that
not able to run the class. If you could just share the logs in pastebin we
could confirm that.

On Wed, Jun 22, 2016 at 4:43 PM, 另一片天 <958943...@qq.com> wrote:

> i  want to avoid Uploading resource file （especially jar package），because
> them very big，the application will wait for too long，there are good method？？
> so i config that para， but not get the my want to effect。
>
>
> -- 原始邮件 ------
> *发件人:* "Yash Sharma";<yash...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 下午2:34
> *收件人:* "另一片天"<958943...@qq.com>;
> *抄送:* "Saisai Shao"<sai.sai.s...@gmail.com>; "user"<user@spark.apache.org>;
>
> *主题:* Re: Could not find or load main class
> org.apache.spark.deploy.yarn.ExecutorLauncher
>
> Try with : --master yarn-cluster
>
> On Wed, Jun 22, 2016 at 4:30 PM, 另一片天 <958943...@qq.com> wrote:
>
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>> --executor-cores 2
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>> 10
>> Warning: Skip remote jar
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar.
>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>> *发送时间:* 2016年6月22日(星期三) 下午2:28
>> *收件人:* "另一片天"<958943...@qq.com>;
>> *抄送:* "Saisai Shao"<sai.sai.s...@gmail.com>; "user"<user@spark.apache.org>;
>>
>> *主题:* Re: Could not find or load main class
>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>
>> Or better , try the master as yarn-cluster,
>>
>> ./bin/spark-submit \
>> --class org.apache.spark.examples.SparkPi \
>> --master yarn-cluster \
>> --driver-memory 512m \
>> --num-executors 2 \
>> --executor-memory 512m \
>> --executor-cores 2 \
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.
>> 1-hadoop2.6.0.jar
>>
>> On Wed, Jun 22, 2016 at 4:27 PM, 另一片天 <958943...@qq.com> wrote:
>>
>>> Is it able to run on local mode ?
>>>
>>> what mean?? standalone mode ?
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>>> *发送时间:* 2016年6月22日(星期三) 下午2:18
>>> *收件人:* "Saisai Shao"<sai.sai.s...@gmail.com>;
>>> *抄送:* "另一片天"<958943...@qq.com>; "user"<user@spark.apache.org>;
>>> *主题:* Re: Could not find or load main class
>>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>>
>>> Try providing the jar with the hdfs prefix. Its probably just because
>>> its not able to find the jar on all nodes.
>>>
>>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.
>>> 1-hadoop2.6.0.jar
>>>
>>> Is it able to run on local mode ?
>>>
>>> On Wed, Jun 22, 2016 at 4:14 PM, Saisai Shao <sai.sai.s...@gmail.com>
>>> wrote:
>>>
>>>> spark.yarn.jar (none) The location of the Spark jar file, in case
>>>> overriding the default location is desired. By default, Spark on YARN will
>>>> use a Spark jar installed locally, but the Spark jar can also be in a
>>>> world-readable location on HDFS. This allows YARN to cache it on nodes so
>>>> that it doesn't need to be distributed each time an application runs. To
>>>> point to a jar on HDFS, for example, set this configuration to
>>>> hdfs:///some/path.
>>>>
>>>> spark.yarn.jar is used for spark

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

 exited with
>  exitCode: 1
> For more detailed output, check application tracking page:
> http://master:8088/proxy/application_1466568126079_0006/Then, click on
> links to logs of each attempt.
> Diagnostics: Exception from container-launch.
> Container id: container_1466568126079_0006_02_01
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Container exited with a non-zero exit code 1
> Failing this attempt. Failing the application.
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: default
> start time: 1466577373576
> final status: FAILED
> tracking URL:
> http://master:8088/cluster/app/application_1466568126079_0006
> user: shihj
> 16/06/22 14:36:27 INFO Client: Deleting staging directory
> .sparkStaging/application_1466568126079_0006
> Exception in thread "main" org.apache.spark.SparkException: Application
> application_1466568126079_0006 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 16/06/22 14:36:27 INFO ShutdownHookManager: Shutdown hook called
> 16/06/22 14:36:27 INFO ShutdownHookManager: Deleting directory
> /tmp/spark-cf23c5a3-d3fb-4f98-9cd2-bbf268766bbc
>
>
>
> -- 原始邮件 --
> *发件人:* "Yash Sharma";<yash...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 下午2:34
> *收件人:* "另一片天"<958943...@qq.com>;
> *抄送:* "Saisai Shao"<sai.sai.s...@gmail.com>; "user"<user@spark.apache.org>;
>
> *主题:* Re: Could not find or load main class
> org.apache.spark.deploy.yarn.ExecutorLauncher
>
> Try with : --master yarn-cluster
>
> On Wed, Jun 22, 2016 at 4:30 PM, 另一片天 <958943...@qq.com> wrote:
>
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>> --executor-cores 2
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>> 10
>> Warning: Skip remote jar
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar.
>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Yash Sharm

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

Try with : --master yarn-cluster

On Wed, Jun 22, 2016 at 4:30 PM, 另一片天 <958943...@qq.com> wrote:

> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
> --executor-cores 2
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
> 10
> Warning: Skip remote jar
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar.
> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> -- 原始邮件 --
> *发件人:* "Yash Sharma";<yash...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 下午2:28
> *收件人:* "另一片天"<958943...@qq.com>;
> *抄送:* "Saisai Shao"<sai.sai.s...@gmail.com>; "user"<user@spark.apache.org>;
>
> *主题:* Re: Could not find or load main class
> org.apache.spark.deploy.yarn.ExecutorLauncher
>
> Or better , try the master as yarn-cluster,
>
> ./bin/spark-submit \
> --class org.apache.spark.examples.SparkPi \
> --master yarn-cluster \
> --driver-memory 512m \
> --num-executors 2 \
> --executor-memory 512m \
> --executor-cores 2 \
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.
> 1-hadoop2.6.0.jar
>
> On Wed, Jun 22, 2016 at 4:27 PM, 另一片天 <958943...@qq.com> wrote:
>
>> Is it able to run on local mode ?
>>
>> what mean?? standalone mode ?
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>> *发送时间:* 2016年6月22日(星期三) 下午2:18
>> *收件人:* "Saisai Shao"<sai.sai.s...@gmail.com>;
>> *抄送:* "另一片天"<958943...@qq.com>; "user"<user@spark.apache.org>;
>> *主题:* Re: Could not find or load main class
>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>
>> Try providing the jar with the hdfs prefix. Its probably just because its
>> not able to find the jar on all nodes.
>>
>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.
>> 1-hadoop2.6.0.jar
>>
>> Is it able to run on local mode ?
>>
>> On Wed, Jun 22, 2016 at 4:14 PM, Saisai Shao <sai.sai.s...@gmail.com>
>> wrote:
>>
>>> spark.yarn.jar (none) The location of the Spark jar file, in case
>>> overriding the default location is desired. By default, Spark on YARN will
>>> use a Spark jar installed locally, but the Spark jar can also be in a
>>> world-readable location on HDFS. This allows YARN to cache it on nodes so
>>> that it doesn't need to be distributed each time an application runs. To
>>> point to a jar on HDFS, for example, set this configuration to
>>> hdfs:///some/path.
>>>
>>> spark.yarn.jar is used for spark run-time system jar, which is spark
>>> assembly jar, not the application jar (example-assembly jar). So in your
>>> case you upload the example-assembly jar into hdfs, in which spark system
>>> jars are not packed, so ExecutorLaucher cannot be found.
>>>
>>> Thanks
>>> Saisai
>>>
>>> On Wed, Jun 22, 2016 at 2:10 PM, 另一片天 <958943...@qq.com> wrote:
>>>
>>>> shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$
>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>>>> --executor-cores 2
>>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar 10
>>>> Warning: Local jar
>>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar does not exist,
>>>> skipping.
>>>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>> at java.lang.C

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

Or better , try the master as yarn-cluster,

./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--driver-memory 512m \
--num-executors 2 \
--executor-memory 512m \
--executor-cores 2 \
hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar

On Wed, Jun 22, 2016 at 4:27 PM, 另一片天 <958943...@qq.com> wrote:

> Is it able to run on local mode ?
>
> what mean?? standalone mode ?
>
>
> -- 原始邮件 ------
> *发件人:* "Yash Sharma";<yash...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 下午2:18
> *收件人:* "Saisai Shao"<sai.sai.s...@gmail.com>;
> *抄送:* "另一片天"<958943...@qq.com>; "user"<user@spark.apache.org>;
> *主题:* Re: Could not find or load main class
> org.apache.spark.deploy.yarn.ExecutorLauncher
>
> Try providing the jar with the hdfs prefix. Its probably just because its
> not able to find the jar on all nodes.
>
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.
> 1-hadoop2.6.0.jar
>
> Is it able to run on local mode ?
>
> On Wed, Jun 22, 2016 at 4:14 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> spark.yarn.jar (none) The location of the Spark jar file, in case
>> overriding the default location is desired. By default, Spark on YARN will
>> use a Spark jar installed locally, but the Spark jar can also be in a
>> world-readable location on HDFS. This allows YARN to cache it on nodes so
>> that it doesn't need to be distributed each time an application runs. To
>> point to a jar on HDFS, for example, set this configuration to
>> hdfs:///some/path.
>>
>> spark.yarn.jar is used for spark run-time system jar, which is spark
>> assembly jar, not the application jar (example-assembly jar). So in your
>> case you upload the example-assembly jar into hdfs, in which spark system
>> jars are not packed, so ExecutorLaucher cannot be found.
>>
>> Thanks
>> Saisai
>>
>> On Wed, Jun 22, 2016 at 2:10 PM, 另一片天 <958943...@qq.com> wrote:
>>
>>> shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$
>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>>> --executor-cores 2
>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar 10
>>> Warning: Local jar
>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar does not exist,
>>> skipping.
>>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:348)
>>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> get error at once
>>> -- 原始邮件 --
>>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>>> *发送时间:* 2016年6月22日(星期三) 下午2:04
>>> *收件人:* "另一片天"<958943...@qq.com>;
>>> *抄送:* "user"<user@spark.apache.org>;
>>> *主题:* Re: Could not find or load main class
>>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>>
>>> How about supplying the jar directly in spark submit -
>>>
>>> ./bin/spark-submit \
>>>> --class org.apache.spark.examples.SparkPi \
>>>> --master yarn-client \
>>>> --driver-memory 512m \
>>>> --num-executors 2 \
>>>> --executor-memory 512m \
>>>> --executor-cores 2 \
>>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>>
>>>
>>> On Wed, Jun 22, 2016 at 3:59 PM, 另一片天 <958943...@qq.com> wrote:
>>>
>>>> i  config this  para  at spark-defaults.conf
>>>> spark.yarn.jar
>>>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>>>
>>>> then ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>> --master yarn-client --driver-memory 512m --num-executors 2
>>>> --executor-memory 512m --executor-cores 210:
>>>>
>>>>
>>>>
>>>>- Error: Could not find or load main class
>>>>org.apache.spark.deploy.yarn.ExecutorLauncher
>>>>
>>>> but  i don't config that para ,there no error  why???that para is only
>>>> avoid Uploading resource file(jar package)??
>>>>
>>>
>>>
>>
>

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

I meant try having the full path in the spark submit command-

./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-client \
--driver-memory 512m \
--num-executors 2 \
--executor-memory 512m \
--executor-cores 2 \
hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar



On Wed, Jun 22, 2016 at 4:23 PM, 另一片天 <958943...@qq.com> wrote:

> shihj@master:~/workspace/hadoop-2.6.4$ bin/hadoop fs -ls
> hdfs://master:9000/user/shihj/spark_lib
> Found 1 items
> -rw-r--r--   3 shihj supergroup  118955968 2016-06-22 10:24
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
> shihj@master:~/workspace/hadoop-2.6.4$
> can find the jar on all nodes.
>
>
> -- 原始邮件 --
> *发件人:* "Yash Sharma";<yash...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 下午2:18
> *收件人:* "Saisai Shao"<sai.sai.s...@gmail.com>;
> *抄送:* "另一片天"<958943...@qq.com>; "user"<user@spark.apache.org>;
> *主题:* Re: Could not find or load main class
> org.apache.spark.deploy.yarn.ExecutorLauncher
>
> Try providing the jar with the hdfs prefix. Its probably just because its
> not able to find the jar on all nodes.
>
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.
> 1-hadoop2.6.0.jar
>
> Is it able to run on local mode ?
>
> On Wed, Jun 22, 2016 at 4:14 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> spark.yarn.jar (none) The location of the Spark jar file, in case
>> overriding the default location is desired. By default, Spark on YARN will
>> use a Spark jar installed locally, but the Spark jar can also be in a
>> world-readable location on HDFS. This allows YARN to cache it on nodes so
>> that it doesn't need to be distributed each time an application runs. To
>> point to a jar on HDFS, for example, set this configuration to
>> hdfs:///some/path.
>>
>> spark.yarn.jar is used for spark run-time system jar, which is spark
>> assembly jar, not the application jar (example-assembly jar). So in your
>> case you upload the example-assembly jar into hdfs, in which spark system
>> jars are not packed, so ExecutorLaucher cannot be found.
>>
>> Thanks
>> Saisai
>>
>> On Wed, Jun 22, 2016 at 2:10 PM, 另一片天 <958943...@qq.com> wrote:
>>
>>> shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$
>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>>> --executor-cores 2
>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar 10
>>> Warning: Local jar
>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar does not exist,
>>> skipping.
>>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:348)
>>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> get error at once
>>> -- 原始邮件 --
>>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>>> *发送时间:* 2016年6月22日(星期三) 下午2:04
>>> *收件人:* "另一片天"<958943...@qq.com>;
>>> *抄送:* "user"<user@spark.apache.org>;
>>> *主题:* Re: Could not find or load main class
>>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>>
>>> How about supplying the jar directly in spark submit -
>>>
>>> ./bin/spark-submit \
>>>> --class org.apache.spark.examples.SparkPi \
>>>> --master yarn-client \
>>>> --driver-memory 512m \
>>>> --num-executors 2 \
>>>> --executor-memory 512m \
>>>> --executor-cores 2 \
>>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>>
>>>
>>> On Wed, Jun 22, 2016 at 3:59 PM, 另一片天 <958943...@qq.com> wrote:
>>>
>>>> i  config this  para  at spark-defaults.conf
>>>> spark.yarn.jar
>>>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>>>
>>>> then ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>> --master yarn-client --driver-memory 512m --num-executors 2
>>>> --executor-memory 512m --executor-cores 210:
>>>>
>>>>
>>>>
>>>>- Error: Could not find or load main class
>>>>org.apache.spark.deploy.yarn.ExecutorLauncher
>>>>
>>>> but  i don't config that para ,there no error  why???that para is only
>>>> avoid Uploading resource file(jar package)??
>>>>
>>>
>>>
>>
>

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

Try providing the jar with the hdfs prefix. Its probably just because its
not able to find the jar on all nodes.

hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar

Is it able to run on local mode ?

On Wed, Jun 22, 2016 at 4:14 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> spark.yarn.jar (none) The location of the Spark jar file, in case
> overriding the default location is desired. By default, Spark on YARN will
> use a Spark jar installed locally, but the Spark jar can also be in a
> world-readable location on HDFS. This allows YARN to cache it on nodes so
> that it doesn't need to be distributed each time an application runs. To
> point to a jar on HDFS, for example, set this configuration to
> hdfs:///some/path.
>
> spark.yarn.jar is used for spark run-time system jar, which is spark
> assembly jar, not the application jar (example-assembly jar). So in your
> case you upload the example-assembly jar into hdfs, in which spark system
> jars are not packed, so ExecutorLaucher cannot be found.
>
> Thanks
> Saisai
>
> On Wed, Jun 22, 2016 at 2:10 PM, 另一片天 <958943...@qq.com> wrote:
>
>> shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
>> --executor-cores 2
>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar 10
>> Warning: Local jar
>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar does not exist,
>> skipping.
>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> get error at once
>> -- 原始邮件 --
>> *发件人:* "Yash Sharma";<yash...@gmail.com>;
>> *发送时间:* 2016年6月22日(星期三) 下午2:04
>> *收件人:* "另一片天"<958943...@qq.com>;
>> *抄送:* "user"<user@spark.apache.org>;
>> *主题:* Re: Could not find or load main class
>> org.apache.spark.deploy.yarn.ExecutorLauncher
>>
>> How about supplying the jar directly in spark submit -
>>
>> ./bin/spark-submit \
>>> --class org.apache.spark.examples.SparkPi \
>>> --master yarn-client \
>>> --driver-memory 512m \
>>> --num-executors 2 \
>>> --executor-memory 512m \
>>> --executor-cores 2 \
>>> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>
>>
>> On Wed, Jun 22, 2016 at 3:59 PM, 另一片天 <958943...@qq.com> wrote:
>>
>>> i  config this  para  at spark-defaults.conf
>>> spark.yarn.jar
>>> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>>>
>>> then ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>> --master yarn-client --driver-memory 512m --num-executors 2
>>> --executor-memory 512m --executor-cores 210:
>>>
>>>
>>>
>>>- Error: Could not find or load main class
>>>org.apache.spark.deploy.yarn.ExecutorLauncher
>>>
>>> but  i don't config that para ,there no error  why???that para is only
>>> avoid Uploading resource file(jar package)??
>>>
>>
>>
>

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma

How about supplying the jar directly in spark submit -

./bin/spark-submit \
> --class org.apache.spark.examples.SparkPi \
> --master yarn-client \
> --driver-memory 512m \
> --num-executors 2 \
> --executor-memory 512m \
> --executor-cores 2 \
> /user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar


On Wed, Jun 22, 2016 at 3:59 PM, 另一片天 <958943...@qq.com> wrote:

> i  config this  para  at spark-defaults.conf
> spark.yarn.jar
> hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar
>
> then ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m
> --executor-cores 210:
>
>
>
>- Error: Could not find or load main class
>org.apache.spark.deploy.yarn.ExecutorLauncher
>
> but  i don't config that para ,there no error  why???that para is only
> avoid Uploading resource file(jar package)??
>

Re: Python to Scala

2016-06-18 Thread Yash Sharma

Couple of things that can work-
- If you know the logic- just forget the python script and write it in
java/scala from scratch
- If you have python functions and libraries used- Pyspark is probably the
best bet.
- If you have specific questions on how to solve a particular
implementation issue you are facing- ask it here :)

Apart from that its really difficult to understand what you are asking.

- Thanks, via mobile,  excuse brevity.
On Jun 18, 2016 3:27 PM, "Aakash Basu" <raj2coo...@gmail.com> wrote:

> I don't have a sound knowledge in Python and on the other hand we are
> working on Spark on Scala, so I don't think it will be allowed to run
> PySpark along with it, so the requirement is to convert the code to scala
> and use it. But I'm finding it difficult.
>
> Did not find a better forum for help than ours. Hence this mail.
> On 18-Jun-2016 10:39 AM, "Stephen Boesch" <java...@gmail.com> wrote:
>
>> What are you expecting us to do?  Yash provided a reasonable approach -
>> based on the info you had provided in prior emails.  Otherwise you can
>> convert it from python to spark - or find someone else who feels
>> comfortable to do it.  That kind of inquiry would likelybe appropriate on a
>> job board.
>>
>>
>>
>> 2016-06-17 21:47 GMT-07:00 Aakash Basu <raj2coo...@gmail.com>:
>>
>>> Hey,
>>>
>>> Our complete project is in Spark on Scala, I code in Scala for Spark,
>>> though am new, but I know it and still learning. But I need help in
>>> converting this code to Scala. I've nearly no knowledge in Python, hence,
>>> requested the experts here.
>>>
>>> Hope you get me now.
>>>
>>> Thanks,
>>> Aakash.
>>> On 18-Jun-2016 10:07 AM, "Yash Sharma" <yash...@gmail.com> wrote:
>>>
>>>> You could use pyspark to run the python code on spark directly. That
>>>> will cut the effort of learning scala.
>>>>
>>>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>>>>
>>>> - Thanks, via mobile,  excuse brevity.
>>>> On Jun 18, 2016 2:34 PM, "Aakash Basu" <raj2coo...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've a python code, which I want to convert to Scala for using it in a
>>>>> Spark program. I'm not so well acquainted with python and learning scala
>>>>> now. Any Python+Scala expert here? Can someone help me out in this please?
>>>>>
>>>>> Thanks & Regards,
>>>>> Aakash.
>>>>>
>>>>
>>

Re: Python to Scala

2016-06-17 Thread Yash Sharma

You could use pyspark to run the python code on spark directly. That will
cut the effort of learning scala.

https://spark.apache.org/docs/0.9.0/python-programming-guide.html

- Thanks, via mobile,  excuse brevity.
On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:

> Hi all,
>
> I've a python code, which I want to convert to Scala for using it in a
> Spark program. I'm not so well acquainted with python and learning scala
> now. Any Python+Scala expert here? Can someone help me out in this please?
>
> Thanks & Regards,
> Aakash.
>

Re: StackOverflow in Spark

2016-06-01 Thread Yash Sharma

Not sure if its related, But I got a similar stack overflow error some time
back while reading files and converting them to parquet.



> Stack trace-
> 16/06/02 02:23:54 INFO YarnAllocator: Driver requested a total number of
> 32769 executor(s).
> 16/06/02 02:23:54 INFO ExecutorAllocationManager: Requesting 16384 new
> executors because tasks are backlogged (new desired total will be 32769)
> 16/06/02 02:23:54 INFO YarnAllocator: Will request 24576 executor
> containers, each with 5 cores and 22528 MB memory including 2048 MB overhead
> 16/06/02 02:23:55 WARN ApplicationMaster: Reporter thread fails 2 time(s)
> in a row.
> at
> scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
> at
> scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
> at
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
> at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
> at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
> at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

java.lang.StackOverflowError
>


On Thu, Jun 2, 2016 at 12:58 PM, Ted Yu  wrote:

> Looking at Michel's stack trace, it seems to be different issue.
>
> On Jun 1, 2016, at 7:45 PM, Matthew Young  wrote:
>
> Hi,
>
> It's related to the one fixed bug in Spark, jira ticket SPARK-6847
> 
>
> Matthew Yang
>
> On Wed, May 25, 2016 at 7:48 PM, Michel Hubert  wrote:
>
>>
>>
>> Hi,
>>
>>
>>
>>
>>
>> I have an Spark application which generates StackOverflowError exceptions
>> after 30+ min.
>>
>>
>>
>> Anyone any ideas?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 16/05/25 10:48:51 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>> 55449.0 (TID 5584, host81440-cld.opentsp.com):
>> java.lang.StackOverflowError
>>
>> ·at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
>>
>> ·at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>
>> ·at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>
>> ·at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>
>> ·at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>
>> ·at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>
>> ·at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>
>> ·at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>
>> ·at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>
>> ·at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>
>> ·at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>>
>> ·at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>
>> ·at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> ·at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> ·at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>
>> ·at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>
>> ·at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>
>> ·at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>
>> ·at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>
>> ·at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>
>> ·at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>
>> ·at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>
>> ·at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>
>> ·at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>
>> ·at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>
>> ·at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>
>> ·at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>
>> ·at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>>
>> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown
>> Source)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Driver stacktrace:
>>
>> ·at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>>
>> ·at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>>
>> ·at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>>
>> ·at
>>

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-25 Thread Yash Sharma

For specific offsets you can directly pass the offset ranges and use the
KafkaUtils. createRDD to get the events those were missed in the Dstream.

- Thanks, via mobile,  excuse brevity.
On Jan 25, 2016 3:33 PM, "Raju Bairishetti" <r...@apache.org> wrote:

> Hi Yash,
>Basically, my question is how to avoid storing the kafka offsets in
> spark checkpoint directory. Streaming context is getting build from
> checkpoint directory and proceeding with the offsets in checkpointed RDD.
>
> I want to consume data from kafka from specific offsets along with the
> spark checkpoints. Streaming context is getting prepared from the
> checkpoint directory and started consuming from the topic offsets which
> were stored in checkpoint directory.
>
>
> On Sat, Jan 23, 2016 at 3:44 PM, Yash Sharma <yash...@gmail.com> wrote:
>
>> Hi Raju,
>> Could you please explain your expected behavior with the DStream. The
>> DStream will have event only from the 'fromOffsets' that you provided in
>> the createDirectStream (which I think is the expected behavior).
>>
>> For the smaller files, you will have to deal with smaller files if you
>> intend to write it immediately. Alternately what we do sometimes is-
>>
>> 1.  Maintain couple of iterations for some 30-40 seconds in application
>> until we have substantial data and then we write them to disk.
>> 2. Push smaller data back to kafka, and a different job handles the save
>> to disk.
>>
>> On Sat, Jan 23, 2016 at 7:01 PM, Raju Bairishetti <r...@apache.org>
>> wrote:
>>
>>> Thanks for quick reply.
>>> I am creating Kafka Dstream by passing offsets map. I have pasted code
>>> snippet in my earlier mail. Let me know am I missing something.
>>>
>>> I want to use spark checkpoint for hand ng only driver/executor
>>> failures.
>>> On Jan 22, 2016 10:08 PM, "Cody Koeninger" <c...@koeninger.org> wrote:
>>>
>>>> Offsets are stored in the checkpoint.  If you want to manage offsets
>>>> yourself, don't restart from the checkpoint, specify the starting offsets
>>>> when you create the stream.
>>>>
>>>> Have you read / watched the materials linked from
>>>>
>>>> https://github.com/koeninger/kafka-exactly-once
>>>>
>>>> Regarding the small files problem, either don't use HDFS, or use
>>>> something like filecrush for merging.
>>>>
>>>> On Fri, Jan 22, 2016 at 3:03 AM, Raju Bairishetti <r...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>I am very new to spark & spark-streaming. I am planning to use
>>>>> spark streaming for real time processing.
>>>>>
>>>>>I have created a streaming context and checkpointing to hdfs
>>>>> directory for recovery purposes in case of executor failures & driver
>>>>> failures.
>>>>>
>>>>> I am creating Dstream with offset map for getting the data from kafka.
>>>>> I am simply ignoring the offsets to understand the behavior. Whenver I
>>>>> restart application driver restored from checkpoint as expected but 
>>>>> Dstream
>>>>> is not getting started from the initial offsets. Dstream was created with
>>>>> the last consumed offsets instead of startign from 0 offsets for each 
>>>>> topic
>>>>> partition as I am not storing the offsets any where.
>>>>>
>>>>> def main : Unit = {
>>>>>
>>>>> var sparkStreamingContext = 
>>>>> StreamingContext.getOrCreate(SparkConstants.CHECKPOINT_DIR_LOCATION,
>>>>>   () => creatingFunc())
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>> }
>>>>>
>>>>> def creatingFunc(): Unit = {
>>>>>
>>>>> ...
>>>>>
>>>>> var offsets:Map[TopicAndPartition, Long] = 
>>>>> Map(TopicAndPartition("sample_sample3_json",0) -> 0)
>>>>>
>>>>> KafkaUtils.createDirectStream[String,String, StringDecoder, 
>>>>> StringDecoder,
>>>>> String](sparkStreamingContext, kafkaParams, offsets, messageHandler)
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>> I want to get control over offset management at event level instead of
>>>>> RDD level to make sure that at least once delivery to end system.
>>>>>
>>>>> As per my understanding, every RDD or RDD partition will stored in
>>>>> hdfs as a file If I choose to use HDFS as output. If I use 1sec as batch
>>>>> interval then it will be ended up having huge number of small files in
>>>>> HDFS. Having small files in HDFS will leads to lots of other issues.
>>>>> Is there any way to write multiple RDDs into single file? Don't have
>>>>> muh idea about *coalesce* usage. In the worst case, I can merge all small
>>>>> files in HDFS in regular intervals.
>>>>>
>>>>> Thanks...
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Raju Bairishetti
>>>>> www.lazada.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>
>
> --
>
> --
> Thanks
> Raju Bairishetti
> www.lazada.com
>

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-23 Thread Yash Sharma

Hi Raju,
Could you please explain your expected behavior with the DStream. The
DStream will have event only from the 'fromOffsets' that you provided in
the createDirectStream (which I think is the expected behavior).

For the smaller files, you will have to deal with smaller files if you
intend to write it immediately. Alternately what we do sometimes is-

1.  Maintain couple of iterations for some 30-40 seconds in application
until we have substantial data and then we write them to disk.
2. Push smaller data back to kafka, and a different job handles the save to
disk.

On Sat, Jan 23, 2016 at 7:01 PM, Raju Bairishetti  wrote:

> Thanks for quick reply.
> I am creating Kafka Dstream by passing offsets map. I have pasted code
> snippet in my earlier mail. Let me know am I missing something.
>
> I want to use spark checkpoint for hand ng only driver/executor failures.
> On Jan 22, 2016 10:08 PM, "Cody Koeninger"  wrote:
>
>> Offsets are stored in the checkpoint.  If you want to manage offsets
>> yourself, don't restart from the checkpoint, specify the starting offsets
>> when you create the stream.
>>
>> Have you read / watched the materials linked from
>>
>> https://github.com/koeninger/kafka-exactly-once
>>
>> Regarding the small files problem, either don't use HDFS, or use
>> something like filecrush for merging.
>>
>> On Fri, Jan 22, 2016 at 3:03 AM, Raju Bairishetti 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>I am very new to spark & spark-streaming. I am planning to use spark
>>> streaming for real time processing.
>>>
>>>I have created a streaming context and checkpointing to hdfs
>>> directory for recovery purposes in case of executor failures & driver
>>> failures.
>>>
>>> I am creating Dstream with offset map for getting the data from kafka. I
>>> am simply ignoring the offsets to understand the behavior. Whenver I
>>> restart application driver restored from checkpoint as expected but Dstream
>>> is not getting started from the initial offsets. Dstream was created with
>>> the last consumed offsets instead of startign from 0 offsets for each topic
>>> partition as I am not storing the offsets any where.
>>>
>>> def main : Unit = {
>>>
>>> var sparkStreamingContext = 
>>> StreamingContext.getOrCreate(SparkConstants.CHECKPOINT_DIR_LOCATION,
>>>   () => creatingFunc())
>>>
>>> ...
>>>
>>>
>>> }
>>>
>>> def creatingFunc(): Unit = {
>>>
>>> ...
>>>
>>> var offsets:Map[TopicAndPartition, Long] = 
>>> Map(TopicAndPartition("sample_sample3_json",0) -> 0)
>>>
>>> KafkaUtils.createDirectStream[String,String, StringDecoder, 
>>> StringDecoder,
>>> String](sparkStreamingContext, kafkaParams, offsets, messageHandler)
>>>
>>> ...
>>> }
>>>
>>> I want to get control over offset management at event level instead of
>>> RDD level to make sure that at least once delivery to end system.
>>>
>>> As per my understanding, every RDD or RDD partition will stored in hdfs
>>> as a file If I choose to use HDFS as output. If I use 1sec as batch
>>> interval then it will be ended up having huge number of small files in
>>> HDFS. Having small files in HDFS will leads to lots of other issues.
>>> Is there any way to write multiple RDDs into single file? Don't have muh
>>> idea about *coalesce* usage. In the worst case, I can merge all small files
>>> in HDFS in regular intervals.
>>>
>>> Thanks...
>>>
>>> --
>>> Thanks
>>> Raju Bairishetti
>>> www.lazada.com
>>>
>>>
>>>
>>>
>>

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma

Hi Jan,
Is the error because a past run of the job has already written to the
location?

In that case you can add more granularity with 'time' along with year and
month. That should give you a distinct path for every run.

Let us know if it helps or if i missed anything.

Goodluck

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 2:31 PM, "Jan Holmberg"  wrote:

> Hi,
> I'm stuck with writing partitioned data to hdfs. Example below ends up
> with 'already exists' -error.
>
> I'm wondering how to handle streaming use case.
>
> What is the intended way to write streaming data to hdfs? What am I
> missing?
>
> cheers,
> -jan
>
>
> import com.databricks.spark.avro._
>
> import org.apache.spark.sql.SQLContext
>
> val sqlContext = new SQLContext(sc)
>
> import sqlContext.implicits._
>
> val df = Seq(
> (2012, 8, "Batman", 9.8),
> (2012, 8, "Hero", 8.7),
> (2012, 7, "Robot", 5.5),
> (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
>
> df.write.partitionBy("year", "month").avro("/tmp/data")
>
> val df2 = Seq(
> (2012, 10, "Batman", 9.8),
> (2012, 10, "Hero", 8.7),
> (2012, 9, "Robot", 5.5),
> (2011, 9, "Git", 2.0)).toDF("year", "month", "title", "rating")
>
> df2.write.partitionBy("year", "month").avro("/tmp/data")
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Apache spark certification pass percentage ?

2015-12-22 Thread Yash Sharma

Hi Sri,
That would depend on the organization from where you are applying the
certification.

This place would be more helpful where you can ask about questions and
information about using spark and/or contributing to spark.

Goodluck

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 3:56 PM, "kali.tumm...@gmail.com" 
wrote:

> Hi All,
>
> Does anyone know pass percentage for Apache spark certification exam ?
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-spark-certification-pass-percentage-tp25761.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Yash Sharma

Hi Evan,
SPARK-9629 referred to connection issues with zookeeper.  Could you check
if its working fine in your setup.

Also please share other error logs you might be getting.

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 5:00 PM, "yaoxiaohua"  wrote:

> Hi,
>
> I encounter a similar question, spark1.4
>
> Master2 run some days , then give a timeout exception, then shutdown.
>
> I found a bug :
>
> https://issues.apache.org/jira/browse/SPARK-9629
>
>
>
>
> INFO ClientCnxn: Client session timed out, have not heard from server in
> 40015ms for sessionid 0x351c416297a145a, closing socket connection and
> attempting reconnect
>
>
>
>
>
> could you tell me what do you do for this?
>
>
>
> Best Regards,
>
> Evan
>

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Yash Sharma

Evan could you also share more logs on the error.  Probably paste here or
in pastebin.

Also check zookeeper logs in case you find anything.

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 6:01 PM, "Dirceu Semighini Filho" <
dirceu.semigh...@gmail.com> wrote:

> Hi Yash,
> I've experienced this behavior here when the process freeze in a worker.
> This mainly happen, in my case, when the worker memory was full and the
> java GC wasn't able to free memory for the process.
> Try to search for outofmemory error in your worker logs.
>
> Regards,
> Dirceu
>
> 2015-12-22 10:26 GMT-02:00 yaoxiaohua <yaoxiao...@outlook.com>:
>
>> Thanks for your reply.
>>
>> I find spark-env.sh :
>>
>> SPARK_JAVA_OPTS="$SPARK_JAVA_OPTS -Dspark.akka.askTimeout=300
>> -Dspark.ui.retainedStages=1000 -Dspark.eventLog.enabled=true
>> -Dspark.eventLog.dir=hdfs://sparkcluster/user/spark_history_logs
>> -Dspark.shuffle.spill=false -Dspark.shuffle.manager=hash
>> -Dspark.yarn.max.executor.failures=9 -Dspark.worker.timeout=300"
>>
>>
>>
>> I just find log like this:
>>
>>
>> INFO ClientCnxn: Client session timed out, have not heard from server in
>> 40015ms for sessionid 0x351c416297a145a, closing socket connection and
>> attempting reconnect
>>
>> Before spark2 master process shut down.
>>
>> I don’t see any zookeeper timeout setting .
>>
>>
>>
>> Best
>>
>>
>>
>> *From:* Yash Sharma [mailto:yash...@gmail.com]
>> *Sent:* 2015年12月22日 19:55
>> *To:* yaoxiaohua
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Client session timed out, have not heard from server in
>>
>>
>>
>> Hi Evan,
>> SPARK-9629 referred to connection issues with zookeeper.  Could you check
>> if its working fine in your setup.
>>
>> Also please share other error logs you might be getting.
>>
>> - Thanks, via mobile,  excuse brevity.
>>
>> On Dec 22, 2015 5:00 PM, "yaoxiaohua" <yaoxiao...@outlook.com> wrote:
>>
>> Hi,
>>
>> I encounter a similar question, spark1.4
>>
>> Master2 run some days , then give a timeout exception, then shutdown.
>>
>> I found a bug :
>>
>> https://issues.apache.org/jira/browse/SPARK-9629
>>
>>
>>
>>
>> INFO ClientCnxn: Client session timed out, have not heard from server in
>> 40015ms for sessionid 0x351c416297a145a, closing socket connection and
>> attempting reconnect
>>
>>
>>
>>
>>
>> could you tell me what do you do for this?
>>
>>
>>
>> Best Regards,
>>
>> Evan
>>
>
>

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma

Well this will indeed hit the error if the next run has similar year and
months and writing would not be possible.

You can try working around by introducing a runCount in partition or in the
output path.

Something like-

/tmp/data/year/month/01
/tmp/data/year/month/02

Or,
/tmp/data/01/year/month
/tmp/data/02/year/month

This is a work around.

Am sure other better approaches would follow.

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 7:01 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:

> Hi Yash,
>
> the error is caused by the fact that first run creates the base directory
> ie. "/tmp/data" and the second batch stumbles to the existing base
> directory. I understand that the existing base directory is a challenge but
> I do not understand how to make this work with streaming example where each
> batch would create a new distinct directory.
>
> Granularity has no impact. No matter how data is partitioned, second
> 'batch' always fails with existing base dir.
>
> scala> df2.write.partitionBy("year").avro("/tmp/data")
> org.apache.spark.sql.AnalysisException: path hdfs://nameservice1/tmp/data
> already exists.;
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at
> com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:37)
> at
> com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:37)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>
>
> On 22 Dec 2015, at 14:06, Yash Sharma <yash...@gmail.com> wrote:
>
> Hi Jan,
> Is the error because a past run of the job has already written to the
> location?
>
> In that case you can add more granularity with 'time' along with year and
> month. That should give you a distinct path for every run.
>
> Let us know if it helps or if i missed anything.
>
> Goodluck
>
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 2:31 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:
>
>> Hi,
>> I'm stuck with writing partitioned data to hdfs. Example below ends up
>> with 'already exists' -error.
>>
>> I'm wondering how to handle streaming use case.
>>
>> What is the intended way to write streaming data to hdfs? What am I
>> missing?
>>
>> cheers,
>> -jan
>>
>>
>> import com.databricks.spark.avro._
>>
>> import org.apache.spark.sql.SQLContext
>>
>> val sqlContext = new SQLContext(sc)
>>
>> import sqlContext.implicits._
>>
>> val df = Seq(
>> (2012, 8, "Batman", 9.8),
>> (2012, 8, "Hero", 8.7),
>> (2012, 7, "Robot", 5.5),
>> (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
>>
>> df.write.partitionBy("year", "month").avro("/tmp/data")
>>
>> val df2 = Seq(
>> (2012, 10, "Batman", 9.8),
>> (2012, 10, "Hero", 8.7),
>> (2012, 9, "Robot", 5.5),
>> (2011, 9, "Git", 2.0)).toDF("year", "month", "title", "rating")
>>
>> df2.write.partitionBy("year", "month").avro("/tmp/data")
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma

Well you are right.  Having a quick glance at the source[1] I see that the
path creation does not consider the partitions.

It tries to create the path before looking for partitions columns.

Not sure what would be the best way to incorporate it. Probably you can
file a jira and experienced contributors can share their thoughts.

1.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ResolvedDataSource.scala

Line- 131

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 7:48 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:

> In my example directories were distinct.
>
> So If I would like to have to distinct directories ex.
>
> /tmp/data/year=2012
> /tmp/data/year=2013
>
> It does not work with
> val df = Seq((2012, "Batman")).toDF("year","title")
>
> df.write.partitionBy("year").avro("/tmp/data")
>
> val df2 = Seq((2013, "Batman")).toDF("year","title")
>
> df2.write.partitionBy("year").avro("/tmp/data")
>
>
> As you can see, it complains about the target directory (/tmp/data) and
> not about the partitioning keys.
>
>
> org.apache.spark.sql.AnalysisException: path hdfs://nameservice1/tmp/data
> already exists.;
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>
>
>
> On 22 Dec 2015, at 15:44, Yash Sharma <yash...@gmail.com> wrote:
>
> Well this will indeed hit the error if the next run has similar year and
> months and writing would not be possible.
>
> You can try working around by introducing a runCount in partition or in
> the output path.
>
> Something like-
>
> /tmp/data/year/month/01
> /tmp/data/year/month/02
>
> Or,
> /tmp/data/01/year/month
> /tmp/data/02/year/month
>
> This is a work around.
>
> Am sure other better approaches would follow.
>
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 7:01 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:
>
>> Hi Yash,
>>
>> the error is caused by the fact that first run creates the base directory
>> ie. "/tmp/data" and the second batch stumbles to the existing base
>> directory. I understand that the existing base directory is a challenge but
>> I do not understand how to make this work with streaming example where each
>> batch would create a new distinct directory.
>>
>> Granularity has no impact. No matter how data is partitioned, second
>> 'batch' always fails with existing base dir.
>>
>> scala> df2.write.partitionBy("year").avro("/tmp/data")
>> org.apache.spark.sql.AnalysisException: path hdfs://nameservice1/tmp/data
>> already exists.;
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>> at
>> com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:37)
>> at
>> com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:37)
>> at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$

37 matches

Mail list logo