Re: Python + Spark unable to connect to S3 bucket .... Invalid hostname in URI

2014-08-15 Thread Miroslaw
So after doing some more research I found the root cause of the problem. The
bucket name we were using contained an underscore '_'. This goes against the
new requirements for naming buckets. Using a bucket that is not named with
an underscore solved the issue.

If anyone else runs into this problem, I hope this will help them out. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12169.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark on yarn cluster can't launch

2014-08-15 Thread centerqi hu
The code does not run as follows

../bin/spark-submit --class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode cluster \

--verbose \

--num-executors 3 \

--driver-memory 4g \

--executor-memory 2g \

--executor-cores 1 \

../lib/spark-examples*.jar \

100

Exception in thread main java.lang.NullPointerException

at
org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:109)

at
org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:108)

at org.apache.spark.Logging$class.logInfo(Logging.scala:58)


However, when I removed --deploy-mode cluster \

Exception disappear.

I think with the deploy-mode cluster is running in yarn cluster mode, if
not, the default will be run in yarn client mode.

But why did yarn cluster get Exception?


Thanks





--
cente...@gmail.com|齐忠


Re: Debugging Task not serializable

2014-08-15 Thread Juan Rodríguez Hortalá
Hi Sourav,

I will take a look to that too, thanks a lot for your help

Greetings,

Juan


2014-07-30 10:58 GMT+02:00 Sourav Chandra sourav.chan...@livestream.com:

 While running application set this 
 -Dsun.io.serialization.extendedDebugInfo=true
 This is applciable post java 1.6 version


 On Wed, Jul 30, 2014 at 2:13 PM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Akhil, Andry, thanks a lot for your suggestions. I will take a look to
 those JVM options.

 Greetings,

 Juan


 2014-07-28 18:56 GMT+02:00 andy petrella andy.petre...@gmail.com:

 Also check the guides for the JVM option that prints messages for such
 problems.
 Sorry, sent from phone and don't know it by heart :/
 Le 28 juil. 2014 18:44, Akhil Das ak...@sigmoidanalytics.com a
 écrit :

  A quick fix would be to implement java.io.Serializable in those
 classes which are causing this exception.



 Thanks
 Best Regards


 On Mon, Jul 28, 2014 at 9:21 PM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I was wondering if someone has conceived a method for debugging Task
 not serializable: java.io.NotSerializableException errors, apart from
 commenting and uncommenting parts of the program, or just turning
 everything into Serializable. I find this kind of error very hard to 
 debug,
 as these are originated in the Spark runtime system.

 I'm using Spark for Java.

 Thanks a lot in advance,

 Juan






 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com



Re: spark streaming - lamda architecture

2014-08-15 Thread Sean Owen
You may be interested in https://github.com/OryxProject/oryx which is
at heart exactly lambda architecture on Spark Streaming. With ML
pipelines on top. The architecture diagram and a peek at the code may
give you a good example of how this could be implemented. I choose to
view the batch layer as just a long-period streaming job on Spark
Streaming, and implement the speed layer as a short-period streaming
job. Summingbird is a good example too although it uses Storm and
MapReduce, and is architected specifically for simple aggregations. I
am not sure it generalizes but you may not need anything complex.

On Thu, Aug 14, 2014 at 10:27 PM, salemi alireza.sal...@udo.edu wrote:
 Hi,

 How would you implement the batch layer of lamda architecture with
 spark/spark streaming?

 Thanks,
 Ali



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Cui xp
Did I describe the problem not clearly? Is anyone familiar to softmax
regression?
Thanks.
Cui xp.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark won't build with maven

2014-08-15 Thread visakh
You are running a Continuous Compilation. AFAIK, it runs in an infinite loop
and will compile only the modified files. For compiling with maven, have a
look at these steps -
https://spark.apache.org/docs/latest/building-with-maven.html

Thanks,
Visakh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-won-t-build-with-maven-tp12173p12176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-15 Thread Carlos J. Gil Bellosta
Thanks for your reply.

I think that the problem was that SparkR tried to serialize the whole
environment. Mind that the large dataframe was part of it. So every
worker received their slice / partition (which is very small) plus the
whole thing!

So I deleted the large dataframe and list before parallelizing and the
cluster ran without memory issues.

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

2014-08-15 3:53 GMT+02:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu:
 Could you try increasing the number of slices with the large data set ?
 SparkR assumes that each slice (or partition in Spark terminology) can fit
 in memory of a single machine.  Also is the error happening when you do the
 map function or does it happen when you combine the results ?

 Thanks
 Shivaram


 On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta
 gilbello...@gmail.com wrote:

 Hello,

 I am having problems trying to apply the split-apply-combine strategy
 for dataframes using SparkR.

 I have a largish dataframe and I would like to achieve something similar
 to what

 ddply(df, .(id), foo)

 would do, only that using SparkR as computing engine. My df has a few
 million records and I would like to split it by id and operate on
 the pieces. These pieces are quite small in size: just a few hundred
 records.

 I do something along the following lines:

 1) Use split to transform df into a list of dfs.
 2) parallelize the resulting list as a RDD (using a few thousand slices)
 3) map my function on the pieces using Spark.
 4) recombine the results (do.call, rbind, etc.)

 My cluster works and I can perform medium sized batch jobs.

 However, it fails with my full df: I get a heap space out of memory
 error. It is funny as the slices are very small in size.

 Should I send smaller batches to my cluster? Is there any recommended
 general approach to these kind of split-apply-combine problems?

 Best,

 Carlos J. Gil Bellosta
 http://www.datanalytics.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Issues with S3 client library and Apache Spark

2014-08-15 Thread Darin McBeath
I've seen a couple of issues posted about this, but I never saw a resolution.

When I'm using Spark 1.0.2 (and the spark-submit script to submit my jobs) and 
AWS SDK 1.8.7, I get the stack trace below.  However, if I drop back to AWS SDK 
1.3.26 (or anything from the AWS SDK 1.4.* family) then everything works fine.  
It would appear that after AWS SDK 1.4, there became a dependency on HTTP 
Client 4.2 (instead of 4.1).  I would like to use the more recent versions of 
the AWS SDK (and not use something nearly 2 years old) so I'm curious whether 
anyone has figured out a workaround to this problem.

Thanks.

Darin.

java.lang.NoSuchMethodError: 
org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
at 
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:408)
at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:390)
at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:374)
at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:313)
at com.elsevier.s3.SimpleStorageService.clinit(SimpleStorageService.java:27)
at com.elsevier.spark.XMLKeyPair.call(SDKeyMapKeyPairRDD.java:75)
at com.elsevier.spark.XMLKeyPair.call(SDKeyMapKeyPairRDD.java:65)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:779)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

Re: Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-15 Thread Denny Lee
Apologies but we had placed the settings for downloading the slides to Seattle 
Spark Meetup members only - but actually meant to share with everyone.  We have 
since fixed this and now you can download it.  HTH!



On August 14, 2014 at 18:14:35, Denny Lee (denny.g@gmail.com) wrote:

For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay 
- Troubleshooting the Everyday Issues, the slides have been now posted at: 
http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf.

Enjoy!
Denny



Re: Spark webUI - application details page

2014-08-15 Thread Brad Miller
Hi Andrew,

I'm running something close to the present master (I compiled several days
ago) but am having some trouble viewing history.

I set spark.eventLog.dir to true, but continually receive the error
message (via the web UI) Application history not found...No event logs
found for application ml-pipeline in
file:/tmp/spark-events/ml-pipeline-1408117588599.  I tried 2 fixes:

-I manually set spark.eventLog.dir to a path beginning with file:///,
believe that perhaps the problem was an invalid protocol specification.

-I inspected /tmp/spark-events manually and noticed that each job directory
(and the files there-in) were owned by the user who launched the job and
were not world readable.  Since I run Spark from a dedicated Spark user, I
set the files world readable but I still receive the same Application
history not found error.

Is there a configuration step I may be missing?

-Brad


On Thu, Aug 14, 2014 at 7:33 PM, Andrew Or and...@databricks.com wrote:

 Hi SK,

 Not sure if I understand you correctly, but here is how the user normally
 uses the event logging functionality:

 After setting spark.eventLog.enabled and optionally
 spark.eventLog.dir, the user runs his/her Spark application and calls
 sc.stop() at the end of it. Then he/she goes to the standalone Master UI
 (under http://master-url:8080 by default) and click on the application
 under the Completed Applications table. This will link to the Spark UI of
 the finished application in its completed state, under a path that looks
 like http://master-url:8080/history/app-Id. It won't be on 
 http://localhost:4040; anymore because the port is now freed for new
 applications to bind their SparkUIs to. To access the file that stores the
 raw statistics, go to the file specified in spark.eventLog.dir. This is
 by default /tmp/spark-events, though in Spark 1.0.1 it may be in HDFS
 under the same path.

 I could be misunderstanding what you mean by the stats being buried in the
 console output, because the events are not logged to the console but to a
 file in spark.eventLog.dir. For all of this to work, of course, you have
 to run Spark in standalone mode (i.e. with master set to
 spark://master-url:7077). In other modes, you will need to use the
 history server instead.

 Does this make sense?
 Andrew


 2014-08-14 18:08 GMT-07:00 SK skrishna...@gmail.com:

 More specifically, as indicated by Patrick above, in 1.0+, apps will have
 persistent state so that the UI can be reloaded. Is there a way to enable
 this feature in 1.0.1?

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12157.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I've been using the standalone cluster all this time and it worked fine.
Recently I'm using another Spark cluster that is based on YARN and I've not
experience with YARN.

The YARN cluster has 10 nodes and a total memory of 480G.

I'm having trouble starting the spark-shell with enough memory.
I'm doing a very simple operation - reading a file 100GB from HDFS and
running a count on it. This fails due to out of memory on the executors.

Can someone point to the command line parameters that I should use for
spark-shell so that it?


Thanks
-Soumya


Re: Spark webUI - application details page

2014-08-15 Thread SK
Hi,

Ok, I was specifying --master local. I changed that to --master
spark://localhostname:7077 and am now  able to see the completed
applications. It provides summary stats about runtime and memory usage,
which is sufficient for me at this time. 

However it doesn't seem to archive the info in the application detail UI
that lists detailed stats about the completed stages of the application -
which would be useful for identifying bottleneck steps in a large
application. I guess we need to capture the application detail UI screen
before the app run completes or find a way to extract this info by  parsing
the Json log file in /tmp/spark-events.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark shell on YARN

2014-08-15 Thread Andrew Or
Hi Soumya,

The driver's console output prints out how much memory is actually granted
to each executor, so from there you can verify how much memory the
executors are actually getting. You should use the '--executor-memory'
argument in spark-shell. For instance, assuming each node has 48G of memory,

bin/spark-shell --executor-memory 46g --master yarn

We leave a small cushion for the OS so we don't take up all of the entire
system's memory. This option also applies to the standalone mode you've
been using, but if you have been using the ec2 scripts, we set
spark.executor.memory in conf/spark-defaults.conf for you automatically
so you don't have to specify it each time on the command line. Of course,
you can also do the same in YARN.

-Andrew



2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com:

 I've been using the standalone cluster all this time and it worked fine.
 Recently I'm using another Spark cluster that is based on YARN and I've
 not experience with YARN.

 The YARN cluster has 10 nodes and a total memory of 480G.

 I'm having trouble starting the spark-shell with enough memory.
 I'm doing a very simple operation - reading a file 100GB from HDFS and
 running a count on it. This fails due to out of memory on the executors.

 Can someone point to the command line parameters that I should use for
 spark-shell so that it?


 Thanks
 -Soumya




Re: spark on yarn cluster can't launch

2014-08-15 Thread Andrew Or
Hi 齐忠,

Thanks for reporting this. You're correct that the default deploy mode is
client. However, this seems to be a bug in the YARN integration code; we
should not throw null pointer exception in any case. What version of Spark
are you using?

Andrew


2014-08-15 0:23 GMT-07:00 centerqi hu cente...@gmail.com:

 The code does not run as follows

 ../bin/spark-submit --class org.apache.spark.examples.SparkPi \

 --master yarn \

 --deploy-mode cluster \

 --verbose \

 --num-executors 3 \

 --driver-memory 4g \

 --executor-memory 2g \

 --executor-cores 1 \

 ../lib/spark-examples*.jar \

 100

 Exception in thread main java.lang.NullPointerException

 at
 org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:109)

 at
 org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:108)

 at org.apache.spark.Logging$class.logInfo(Logging.scala:58)


 However, when I removed --deploy-mode cluster \

 Exception disappear.

 I think with the deploy-mode cluster is running in yarn cluster mode, if
 not, the default will be run in yarn client mode.

 But why did yarn cluster get Exception?


 Thanks





 --
 cente...@gmail.com|齐忠



[Spar Streaming] How can we use consecutive data points as the features ?

2014-08-15 Thread Yan Fang
Hi guys,

We have a use case where we need to use consecutive data points to predict
the status. (yes, like using time series data to predict the machine
failure). Is there a straight-forward way to do this in Spark Streaming?

If all consecutive data points are in one batch, it's not complicated
except that the order of data points is not guaranteed in the batch and so
I have to use the timestamp in the data point to reach my goal. However,
when the consecutive data points spread in two or more batches, how can I
do this? From my understanding, I need to use the state management. But
it's not easy to use the updateStateByKey. e.g. I will need to update one
data point and delete the oldest data point but can not do them in a batch
fashion.

Does anyone have similar use case in the community and how do you solve
this? Thank you.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Cui

You can take a look at multinomial logistic regression PR I created.

https://github.com/apache/spark/pull/1379

Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Aug 15, 2014 at 2:24 AM, Cui xp lifeiniao...@gmail.com wrote:
 Did I describe the problem not clearly? Is anyone familiar to softmax
 regression?
 Thanks.
 Cui xp.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I just checked the YARN config and looks like I need to change this value.
Should be upgraded to 48G (the max memory allocated to YARN) per node ?

property
nameyarn.scheduler.maximum-allocation-mb/name
value6144/value
sourcejava.io.BufferedInputStream@2e7e1ee/source
/property


On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 Andrew,

 Thanks for your response.

 When I try to do the following.

  ./spark-shell --executor-memory 46g --master yarn

 I get the following error.

 Exception in thread main java.lang.Exception: When running with master
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
 environment.

 at
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)

 at
 org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)

 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)

  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 After this I set the following env variable.

 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/

 The program launches but then halts with the following error.


 *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB),
 is above the max threshold (6144 MB) of this cluster.*

 I guess this is some YARN setting that is not set correctly.


 Thanks

 -Soumya


 On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote:

 Hi Soumya,

 The driver's console output prints out how much memory is actually
 granted to each executor, so from there you can verify how much memory the
 executors are actually getting. You should use the '--executor-memory'
 argument in spark-shell. For instance, assuming each node has 48G of memory,

 bin/spark-shell --executor-memory 46g --master yarn

 We leave a small cushion for the OS so we don't take up all of the entire
 system's memory. This option also applies to the standalone mode you've
 been using, but if you have been using the ec2 scripts, we set
 spark.executor.memory in conf/spark-defaults.conf for you automatically
 so you don't have to specify it each time on the command line. Of course,
 you can also do the same in YARN.

 -Andrew



 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com:

 I've been using the standalone cluster all this time and it worked fine.
 Recently I'm using another Spark cluster that is based on YARN and I've
 not experience with YARN.

 The YARN cluster has 10 nodes and a total memory of 480G.

 I'm having trouble starting the spark-shell with enough memory.
 I'm doing a very simple operation - reading a file 100GB from HDFS and
 running a count on it. This fails due to out of memory on the executors.

 Can someone point to the command line parameters that I should use for
 spark-shell so that it?


 Thanks
 -Soumya






Re: Running Spark shell on YARN

2014-08-15 Thread Sandy Ryza
We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
maximum node capacity.

-Sandy


On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 I just checked the YARN config and looks like I need to change this value.
 Should be upgraded to 48G (the max memory allocated to YARN) per node ?

 property
 nameyarn.scheduler.maximum-allocation-mb/name
 value6144/value
 sourcejava.io.BufferedInputStream@2e7e1ee/source
 /property


 On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com
 wrote:

 Andrew,

 Thanks for your response.

 When I try to do the following.

  ./spark-shell --executor-memory 46g --master yarn

 I get the following error.

 Exception in thread main java.lang.Exception: When running with master
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
 environment.

 at
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)

 at
 org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)

 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)

  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 After this I set the following env variable.

 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/

 The program launches but then halts with the following error.


 *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104
 MB), is above the max threshold (6144 MB) of this cluster.*

 I guess this is some YARN setting that is not set correctly.


 Thanks

 -Soumya


 On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote:

 Hi Soumya,

 The driver's console output prints out how much memory is actually
 granted to each executor, so from there you can verify how much memory the
 executors are actually getting. You should use the '--executor-memory'
 argument in spark-shell. For instance, assuming each node has 48G of memory,

 bin/spark-shell --executor-memory 46g --master yarn

 We leave a small cushion for the OS so we don't take up all of the
 entire system's memory. This option also applies to the standalone mode
 you've been using, but if you have been using the ec2 scripts, we set
 spark.executor.memory in conf/spark-defaults.conf for you automatically
 so you don't have to specify it each time on the command line. Of course,
 you can also do the same in YARN.

 -Andrew



 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com:

 I've been using the standalone cluster all this time and it worked fine.
 Recently I'm using another Spark cluster that is based on YARN and I've
 not experience with YARN.

 The YARN cluster has 10 nodes and a total memory of 480G.

 I'm having trouble starting the spark-shell with enough memory.
 I'm doing a very simple operation - reading a file 100GB from HDFS and
 running a count on it. This fails due to out of memory on the executors.

 Can someone point to the command line parameters that I should use for
 spark-shell so that it?


 Thanks
 -Soumya







Hardware Context on Spark Worker Hosts

2014-08-15 Thread Chris Brown
Is it practical to maintain a hardware context on each of the worker hosts in 
Spark?  In my particular problem I have an OpenCL (or JavaCL) context which has 
two things associated with it:
  - Data stored on a GPU
  - Code compiled for the GPU
If the context goes away, the data is lost and the code must be recompiled.

The code calling in is quite basic and is intended to be used in batch and 
streaming modes.  Here is the batch version:

object Classify {
  def run(sparkContext: SparkContext, config: com.infoblox.Config) {
val subjects = Subject.load(sparkContext, config)
val classifications = subjects.mapPartitions(subjectIter = 
classify(config.gpu, subjectIter)).reduceByKey(_ + _)
classifications.saveAsTextFile(config.output)
  }

  private def classify(gpu: Option[String], subjects: Iterator[Subject]): 
Iterator[(String, Long)] = {
val javaCLContext = JavaCLContext.build(gpu)  // --
val classifier = Classifier.build(javaCLContext)  // -- 
subjects.foreach(subject = classifier.classifyInBatches(subject))
classifier.classifyRemaining
val results = classifier.results
classifier.release
results.result.iterator
  }
}

The two lines with -- on them are where the JavaCL/OpenCL context is currently 
created and used, and which is wrong.  The JavaCL context is specific to the 
host, not the map.  How do I keep this context between maps, and over a longer 
duration for a streaming job?

Thanks,
Chris...

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark webUI - application details page

2014-08-15 Thread Andrew Or
@Brad

Your configuration looks alright to me. We parse both file:/ and
file:/// the same way so that shouldn't matter. I just tried this on the
latest master and verified that it works for me. Can you dig into the
directory /tmp/spark-events/ml-pipeline-1408117588599 to make sure that
it's not empty? In particular, look for a file that looks like
EVENT_LOG_0, then check the content of that file. The last event (on the
last line) of the file should be an Application Complete event. If this
is not true, it's likely that your application did not call sc.stop(),
though the logs should still show up in spite of that. If all of that
fails, try logging it in a more accessible place through setting
spark.eventLog.dir. Let me know if that helps.

@SK

You shouldn't need to capture the screen before it finishes; the whole
point of the event logging functionality is that the user doesn't have to
do that themselves. What happens if you click into the application detail
UI? In Spark 1.0.1, if it can't find the logs it may just refresh instead
of printing a more explicit message. However, from your configuration you
should be able to see the detailed stage information in the UI in addition
to just the summary statistics under Completed Applications. I have
listed a few debugging steps in the paragraph above, so maybe they're also
applicable to you.

Let me know if that works,
Andrew


2014-08-15 11:07 GMT-07:00 SK skrishna...@gmail.com:

 Hi,

 Ok, I was specifying --master local. I changed that to --master
 spark://localhostname:7077 and am now  able to see the completed
 applications. It provides summary stats about runtime and memory usage,
 which is sufficient for me at this time.

 However it doesn't seem to archive the info in the application detail UI
 that lists detailed stats about the completed stages of the application -
 which would be useful for identifying bottleneck steps in a large
 application. I guess we need to capture the application detail UI screen
 before the app run completes or find a way to extract this info by  parsing
 the Json log file in /tmp/spark-events.

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12187.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




closure issue - works in scalatest but not in spark-shell

2014-08-15 Thread Mohit Jaggi
Folks,

I wrote the following wrapper on top on combineByKey. The RDD is of
Array[Any] and I am extracting a field at a given index for combining.
There are two ways in which I tried this:

Option A: leave colIndex abstract in Aggregator class and define in derived
object Aggtor with value -1. It is set later in function myAggregate. Works
fine but I want to keep the API user unaware of colIndex.

Option B(shown in code below): Set colIndex to -1 in abstract class. Aggtor
does not mention it at all. It is set later in myAggregate.

Option B works from scalatest in Eclipse but runs into closure mishap in
scala-shell. I am looking for an explanation and a possible
solution/workaround. Appreciate any help!

Thanks,

Mohit.

-- API helper -

abstract class Aggregator[U] {

var colIndex: Int = -1

def convert(a: Array[Any]): U = {

a(colIndex).asInstanceOf[U]

}

def mergeValue(a: U, b: Array[Any]): U = {

aggregate(a, convert(b))

}

def mergeCombiners(x: U, y: U): U = {

aggregate(x, y)

}

def aggregate(p: U, q: U): U

}

-- API handler -

def myAggregate[U: ClassTag](...aggtor: Aggregator[U]) = {

aggtor.colIndex = something

keyBy(aggByCol).combineByKey(aggtor.convert, aggtor.mergeValue,
aggtor.mergeCombiners)

}


 call the API 

case object Aggtor extends Aggregator[List[String]] {

//var colIndex = -1

 def aggregate = 

}

myAggregate(...Aggtor)


Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Debasish Das
DB,

Did you compare softmax regression with one-vs-all and found that softmax
is better ?

one-vs-all can be implemented as a wrapper over binary classifier that we
have in mllib...I am curious if softmax multinomial is better on most cases
or is it worthwhile to add a one vs all version of mlor as well ?

Thanks.
Deb


On Fri, Aug 15, 2014 at 11:39 AM, DB Tsai dbt...@dbtsai.com wrote:

 Hi Cui

 You can take a look at multinomial logistic regression PR I created.

 https://github.com/apache/spark/pull/1379

 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Aug 15, 2014 at 2:24 AM, Cui xp lifeiniao...@gmail.com wrote:
  Did I describe the problem not clearly? Is anyone familiar to softmax
  regression?
  Thanks.
  Cui xp.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Debasish,

I didn't try one-vs-all vs softmax regression. One issue is that for
one-vs-all, we have to train k classifiers for k classes problem. The
training time will be k times longer.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Aug 15, 2014 at 11:53 AM, Debasish Das debasish.da...@gmail.com
wrote:

 DB,

 Did you compare softmax regression with one-vs-all and found that softmax
 is better ?

 one-vs-all can be implemented as a wrapper over binary classifier that we
 have in mllib...I am curious if softmax multinomial is better on most cases
 or is it worthwhile to add a one vs all version of mlor as well ?

 Thanks.
 Deb


 On Fri, Aug 15, 2014 at 11:39 AM, DB Tsai dbt...@dbtsai.com wrote:

 Hi Cui

 You can take a look at multinomial logistic regression PR I created.

 https://github.com/apache/spark/pull/1379

 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Aug 15, 2014 at 2:24 AM, Cui xp lifeiniao...@gmail.com wrote:
  Did I describe the problem not clearly? Is anyone familiar to softmax
  regression?
  Thanks.
  Cui xp.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





ALS checkpoint performance

2014-08-15 Thread Debasish Das
Hi,

Are there any experiments detailing the performance hit due to HDFS
checkpoint in ALS ?

As we scale to large ranks with more ratings, I believe we have to cut the
RDD lineage to safe guard against the lineage issue...

Thanks.
Deb


Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
After changing the allocation I'm getting the following in my logs. No idea
what this means.

14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372
yarnAppState: ACCEPTED


On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
 maximum node capacity.

 -Sandy


 On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com
  wrote:

 I just checked the YARN config and looks like I need to change this
 value. Should be upgraded to 48G (the max memory allocated to YARN) per
 node ?

 property
 nameyarn.scheduler.maximum-allocation-mb/name
 value6144/value
 sourcejava.io.BufferedInputStream@2e7e1ee/source
 /property


 On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com
  wrote:

 Andrew,

 Thanks for your response.

 When I try to do the following.

  ./spark-shell --executor-memory 46g --master yarn

 I get the following error.

 Exception in thread main java.lang.Exception: When running with master
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
 environment.

 at
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)

 at
 org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)

 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)

  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 After this I set the following env variable.

 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/

 The program launches but then halts with the following error.


 *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104
 MB), is above the max threshold (6144 MB) of this cluster.*

 I guess this is some YARN setting that is not set correctly.


 Thanks

 -Soumya


 On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or 

Re: Running Spark shell on YARN

2014-08-15 Thread Kevin Markey

  
  
Sandy and others:

Is there a single source of Yarn/Hadoop properties that should be
set or reset for running Spark on Yarn?
We've sort of stumbled through one property after another, and
(unless there's an update I've not yet seen) CDH5 Spark-related
properties are for running the Spark Master instead of Yarn.

Thanks
Kevin

On 08/15/2014 12:47 PM, Sandy Ryza
  wrote:


  We generally recommend setting yarn.scheduler.maximum-allocation-mbto
the maximum node capacity.

  

-Sandy
  
  

On Fri, Aug 15, 2014 at 11:41 AM,
  Soumya Simanta soumya.sima...@gmail.com
  wrote:
  
I just checked the YARN config and looks like
  I need to change this value. Should be upgraded to 48G
  (the max memory allocated to YARN) per node ? 
  

  
  

  property

  nameyarn.scheduler.maximum-allocation-mb/name
  value6144/value
  sourcejava.io.BufferedInputStream@2e7e1ee/source

/property
  


  
  On Fri, Aug 15, 2014 at 2:37 PM,
Soumya Simanta soumya.sima...@gmail.com
wrote:

  Andrew, 


Thanks for your response. 


When I try to do the following. 

   ./spark-shell
  --executor-memory 46g --master yarn
  I get the following error. 
  Exception
  in thread "main" java.lang.Exception: When
  running with master 'yarn' either
  HADOOP_CONF_DIR or YARN_CONF_DIR must be set
  in the environment.
   at
org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
   at
org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)
   at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
  
  
   at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  After this I set the following env variable. 
  
  
  export
  YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
  The program launches but then halts with the
following error. 
  
  
  
  
  14/08/15 14:33:22 ERROR
yarn.Client: Required executor memory (47104
MB), is above the max threshold (6144 MB) of
this cluster.
  
I guess this is some YARN setting that is not
set correctly. 
  
  
  
  Thanks
  
  -Soumya

  
  

  

  


On Fri, Aug 15,
  2014 at 2:19 PM, Andrew Or and...@databricks.com
  wrote:
  
Hi Soumya,
  
  
  The driver's console output
prints out how much memory is
actually granted to each executor,
so from there you can verify how
much memory the executors are
actually getting. You should use the
'--executor-memory' argument in
spark-shell. For instance, assuming
each node has 48G of memory,
  
  
  bin/spark-shell --executor-memory
46g --master yarn
  
  
  We leave a small cushion for the
   

spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread salemi
Hi All,

I am just trying to save the kafka dstream to hadoop as followed

  val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
  dStream.saveAsHadoopFiles(hdfsDataUrl, data)

It throws the following exception. What am I doing wrong? 

14/08/15 14:30:09 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf
java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
at
org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:168)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:185)
at
org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:259)
at
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
^C14/08/15 14:30:10 ERROR OneForOneStrategy:
org.apache.hadoop.mapred.JobConf
java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at

Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-15 Thread Brandon Amos
Hi Spark community,

At Adobe Research, we're happy to open source a prototype
technology called Spindle we've been developing over
the past few months for processing analytics queries with Spark.
Please take a look at the repository on GitHub at
https://github.com/adobe-research/spindle,
and we welcome any feedback. Thanks!

Regards,
Brandon.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread Sean Owen
Somewhere, your function has a reference to the Hadoop JobConf object
and is trying to send that to the workers. It's not in this code you
pasted so must be from something slightly different?

It shouldn't need to send that around and in fact it can't be
serialized as you see. If you need a Hadoop Configuration object, you
can get that from SparkContext, which you can get from the
StreamingContext.

On Fri, Aug 15, 2014 at 9:37 PM, salemi alireza.sal...@udo.edu wrote:
 Hi All,

 I am just trying to save the kafka dstream to hadoop as followed

   val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
   dStream.saveAsHadoopFiles(hdfsDataUrl, data)

 It throws the following exception. What am I doing wrong?

 14/08/15 14:30:09 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf
 java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
 at
 org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:168)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at
 org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:185)
 at
 org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:259)
 at
 org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167)
 at
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 ^C14/08/15 14:30:10 ERROR OneForOneStrategy:
 org.apache.hadoop.mapred.JobConf
 java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 

Re: ALS checkpoint performance

2014-08-15 Thread Xiangrui Meng
Guoqiang reported some results in his PRs
https://github.com/apache/spark/pull/828 and
https://github.com/apache/spark/pull/929 . But this is really
problem-dependent. -Xiangrui

On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 Are there any experiments detailing the performance hit due to HDFS
 checkpoint in ALS ?

 As we scale to large ranks with more ratings, I believe we have to cut the
 RDD lineage to safe guard against the lineage issue...

 Thanks.
 Deb


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread salemi
Look this is the whole program. I am not trying to serialize the JobConf.
 
def main(args: Array[String]) {
try {
  val properties = getProperties(settings.properties)
  StreamingExamples.setStreamingLogLevels()
  val zkQuorum =  properties.get(zookeeper.list).toString()
  val topic = properties.get(topic.name).toString()
  val group = properties.get(group.name).toString()
  val threads = properties.get(consumer.threads).toString()
  val topicpMap = Map(topic - threads.toInt)
  val hdfsNameNodeUrl = properties.get(hdfs.namenode.url).toString()
  val hdfsCheckPointUrl = hdfsNameNodeUrl +
properties.get(hdfs.checkpoint.path).toString()
  val hdfsDataUrl = hdfsNameNodeUrl +
properties.get(hdfs.data.path).toString()
  val checkPointInterval =
properties.get(spark.streaming.checkpoint.interval).toString().toInt
  val sparkConf = new SparkConf().setAppName(KafkaMessageReceiver)
  println(===)
  println(kafka configuration: zk: + zkQuorum + ; topic: + topic +
; group: + group +  ; threads: + threads)
  println(===)
  val ssc = new StreamingContext(sparkConf, Seconds(1))
  ssc.checkpoint(hdfsCheckPointUrl)
  val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
  dStream.checkpoint(Seconds(checkPointInterval))
  dStream.saveAsNewAPIHadoopFiles(hdfsDataUrl, csv, classOf[String],
classOf[String], classOf[TextOutputFormat[String,String]],
ssc.sparkContext.hadoopConfiguration)
  
  val eventData = dStream.map(_._2).map(_.split(,)).map(data =
DataObject(data(0), data(1), data(2), data(3), data(4), data(5), data(6),
data(7), data(8).toLong, data(9), data(10), data(11), data(12).toLong,
data(13), data(14)))
  val count = eventData.filter(_.state ==
COMPLETE).countByWindow(Minutes(15), Seconds(1))
  count.map(cnt = the Total count of calls in complete state  in the
last 15 minutes is:  + cnt).print()
  ssc.start()
  ssc.awaitTermination()

} catch {
  case e: Exception = println(exception caught:  + e);
}
  }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-saving-kafka-DStream-into-hadoop-throws-exception-tp12202p12207.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



mlib model viewing and saving

2014-08-15 Thread Sameer Tilak
Hi All,
I have a mlib model:
val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)

I see model has following methods:algo   asInstanceOf   isInstanceOf   
predicttoString   topNode
model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf = 
false, predict = 0.5, split = Some(Feature = 87, threshold = 
0.7931471805599453, featureType =  Continuous, categories = List()), stats = 
Some(gain = 0.89, impurity = 0.35, left impurity = 0.12, right 
impurity = 0.00, predict = 0.50)
I was wondering what is the best way to look at the model. We want to see what 
the decision tree looks like-- which features are selected, the details of 
splitting, what is the depth etc. Is there an easy way to see that? I can 
traverse it recursively using topNode.leftNode and topNode.rightNode. However, 
was wondering if there is any way to look at the model and also to save it on 
the hdfs for later use. 
  

Does HiveContext support Parquet?

2014-08-15 Thread lyc
Since SqlContext supports less SQL than Hive (if I understand correctly), I
plan to run more queries by hql. However, is that possible to create some
tables as Parquet in hql? What kind of commands should I use? Thanks in
advance for any information. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala Spark Distinct on a case class doesn't work

2014-08-15 Thread clarkroberts
I just discovered that the Distinct call is working as expected when I run a
driver through spark-submit. This is only an issue in the REPL environment.
Very strange...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-Spark-Distinct-on-a-case-class-doesn-t-work-tp12206p12210.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Updating exising JSON files

2014-08-15 Thread ejb11235
I have a bunch of JSON files stored in HDFS that I want to read in, modify,
and write back out. I'm new to all this and am not sure if this is even the
right thing to do.

Basically, my JSON files contain my raw data, and I want to calculate some
derived data and add is to the existing data.

So first, is my basic approach to the problem flawed? Should I be placing
derived data somewhere else?

If not, how to I modify the existing JSON files?

Note: I have been able to read the JSON files into an RDD using
sqlContext.jsonFile, and save them back using RDD.saveAsTextFile(). But this
creates new files. Is there a way to over write the original files?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-tp12211.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does HiveContext support Parquet?

2014-08-15 Thread Silvio Fiorito
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was 
include the parquet-hive-bundle-1.5.0.jar on my classpath.

From: lycmailto:yanchen@huawei.com
Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org

Since SqlContext supports less SQL than Hive (if I understand correctly), I
plan to run more queries by hql. However, is that possible to create some
tables as Parquet in hql? What kind of commands should I use? Thanks in
advance for any information.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread salemi
if I reduce the app to the following code then I don't see the exception. It
creates the hadoop files but they are empty! The DStream doesn't get written
out to the files!

 def main(args: Array[String]) {
try {
  val properties = getProperties(settings.properties)
  StreamingExamples.setStreamingLogLevels()
  val zkQuorum =  properties.get(zookeeper.list).toString()
  val topic = properties.get(topic.name).toString()
  val group = properties.get(group.name).toString()
  val threads = properties.get(consumer.threads).toString()
  val topicpMap = Map(topic - threads.toInt)
  val hdfsNameNodeUrl = properties.get(hdfs.namenode.url).toString()
  val hdfsCheckPointUrl = hdfsNameNodeUrl +
properties.get(hdfs.checkpoint.path).toString()
  val hdfsDataUrl = hdfsNameNodeUrl +
properties.get(hdfs.data.path).toString()
  val checkPointInterval =
properties.get(spark.streaming.checkpoint.interval).toString().toInt
  val sparkConf = new SparkConf().setAppName(KafkaMessageReceiver)
  println(===)
  println(kafka configuration: zk: + zkQuorum + ; topic: + topic +
; group: + group +  ; threads: + threads)
  println(===)
  val ssc = new StreamingContext(sparkConf, Seconds(1))
  val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
  dStream.saveAsNewAPIHadoopFiles(hdfsDataUrl, csv, classOf[String],
classOf[String], classOf[TextOutputFormat[String,String]],
ssc.sparkContext.hadoopConfiguration)

  ssc.start()
  ssc.awaitTermination()

} catch {
  case e: Exception = println(exception caught:  + e);
}
  }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-saving-kafka-DStream-into-hadoop-throws-exception-tp12202p12213.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-15 Thread abhiguruvayya
My use case as mentioned below.

1. Read input data from local file system using sparkContext.textFile(input
path).
2. partition the input data(80 million records) into partitions using
RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer
function. Without using coalesce() or repartition() on the input data spark
executes really slow and fails with out of memory exception.

The issue i am facing here is in deciding the number of partitions to be
applied on the input data. *The input data  size varies every time and hard
coding a particular value is not an option. And spark performs really well
only when certain optimum partition is applied on the input data for which i
have to perform lots of iteration(trial and error). Which is not an option
in a production environment.*

My question: Is there a thumb rule to decide the number of partitions
required depending on the input data size and cluster resources
available(executors,cores, etc...)? If yes please point me in that
direction. Any help  is much appreciated.

I am using spark 1.0 on yarn.

Thanks,
AG



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX Pagerank application

2014-08-15 Thread Ankur Dave
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers 
alexander.rigg...@gmail.com wrote:

 To perform the page rank I have to create a graph object, adding the edges
 by setting sourceID=id and distID=brand. In GraphLab there is function: g =
 SGraph().add_edges(data, src_field='id', dst_field='brand')

 Is there something similar in GraphX?


It sounds like you're trying to parse an edge list file into a graph, where
each line is a comma-separated pair of numeric vertex ids. There's a
built-in parser for tab-separated pairs (see GraphLoader) and it should be
easy to adapt that to comma-separated pairs. You can also drop the header
line using RDD#filter (and eventually using
https://github.com/apache/spark/pull/1839).

Ankur http://www.ankurdave.com/


Re: Does HiveContext support Parquet?

2014-08-15 Thread lyc
Thank you for your reply. 

Do you know where I can find some detailed information about how to use
Parquet in HiveContext? 

Any information is appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error in sbt/sbt package

2014-08-15 Thread Deep Pradhan
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0
sbt/sbt/package

java.io.IOException: Cannot run program
/home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No
such file or directory

 ...lots of errors

 [error] (core/compile:compile) java.io.IOException: Cannot run program
/home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No
such file or directory

[error] Total time: 198 s, completed 16 Aug, 2014 10:25:50 AM

My ~/.bashrc file has the following (apart from other paths too)

export JAVA_HOME=usr/lib/jvm/java-7-oracle

export PATH=$PATH:$JAVA_HOME/bin


 My spark-env.sh file has the following (apart from other paths)

export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle

Can anyone tell me what I should modify?

Thank You