Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Akhil Das
You can also mount HDFS through the NFS gateway and access i think.

Thanks
Best Regards

On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:

 http://hortonworks.com/blog/windows-explorer-experience-hdfs/

 Seemed to exist, now now sign.

 Anything similar to tie HDFS into windows explorer?

 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: History server is not receiving any event

2015-08-29 Thread Akhil Das
Are you starting your history server?

./sbin/start-history-server.sh

You can read more here
http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact



Thanks
Best Regards

On Tue, Aug 25, 2015 at 1:07 AM, b.bhavesh b.borisan...@gmail.com wrote:

 Hi,

 I am working on streaming application.
 I tried to configure history server to persist the events of application in
 hadoop file system (hdfs). However, it is not logging any events.
 I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three
 nodes.
 Here is my configuration:
 File - /usr/local/spark/conf/spark-defaults.conf#In all three nodes
 spark.eventLog.enabled true
 spark.eventLog.dir hdfs://master-host:port/usr/local/hadoop/spark_log

 #in master node
 export

 SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://host:port/usr/local/hadoop/spark_log

 Can someone give list of steps to configure history server.

 Thanks and regards,
 b.bhavesh





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/History-server-is-not-receiving-any-event-tp24426.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Ted Yu
See
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

FYI

On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can also mount HDFS through the NFS gateway and access i think.

 Thanks
 Best Regards

 On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:

 http://hortonworks.com/blog/windows-explorer-experience-hdfs/

 Seemed to exist, now now sign.

 Anything similar to tie HDFS into windows explorer?

 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Array Out OF Bound Exception

2015-08-29 Thread Akhil Das
I suspect in the last scenario you are having an empty new line at the last
line. If you put a try..catch you'd definitely know.

Thanks
Best Regards

On Tue, Aug 25, 2015 at 2:53 AM, Michael Armbrust mich...@databricks.com
wrote:

 This top line here is indicating that the exception is being throw from
 your code (i.e. code written in the console).

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40)


 Check to make sure that you are properly handling data that has less
 columns than you would expect.



 On Mon, Aug 24, 2015 at 12:41 PM, SAHA, DEBOBROTA ds3...@att.com wrote:

 Hi ,



 I am using SPARK 1.4 and I am getting an array out of bound Exception
 when I am trying to read from a registered table in SPARK.



 For example If I have 3 different text files with the content as below:



 *Scenario 1*:

 A1|B1|C1

 A2|B2|C2



 *Scenario 2*:

 A1| |C1

 A2| |C2



 *Scenario 3*:

 A1| B1|

 A2| B2|



 So for Scenario 1 and 2 it’s working fine but for Scenario 3 I am getting
 the following error:



 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 3.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException: 2

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40)

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:38)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to
 (TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)

 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

 at org.apache.spark.scheduler.Task.run(Task.scala:70)

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)



 Driver stacktrace:

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)

 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)



 Please help.



 Thanks,

 Debobrota











Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Roberto Congiu
If HDFS is on a linux VM, you could also mount it with FUSE and export it
with samba

2015-08-29 2:26 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 See
 https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

 FYI

 On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can also mount HDFS through the NFS gateway and access i think.

 Thanks
 Best Regards

 On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:

 http://hortonworks.com/blog/windows-explorer-experience-hdfs/

 Seemed to exist, now now sign.

 Anything similar to tie HDFS into windows explorer?

 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Roberto Congiu
It depends, if HDFS is running under windows, FUSE won't work, but if HDFS
is on a linux VM, Box, or cluster, then you can have the linux box/vm mount
HDFS through FUSE and at the same time export its mount point on samba. At
that point, your windows machine can just connect to the samba share.
R.

2015-08-29 4:04 GMT-07:00 Dino Fancellu d...@felstar.com:

 I'm using Windows.

 Are you saying it works with Windows?

 Dino.

 On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote:
  You can also mount HDFS through the NFS gateway and access i think.
 
  Thanks
  Best Regards
 
  On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:
 
  http://hortonworks.com/blog/windows-explorer-experience-hdfs/
 
  Seemed to exist, now now sign.
 
  Anything similar to tie HDFS into windows explorer?
 
  Thanks,
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Dino Fancellu
I'm using Windows.

Are you saying it works with Windows?

Dino.

On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote:
 You can also mount HDFS through the NFS gateway and access i think.

 Thanks
 Best Regards

 On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:

 http://hortonworks.com/blog/windows-explorer-experience-hdfs/

 Seemed to exist, now now sign.

 Anything similar to tie HDFS into windows explorer?

 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Apache Spark Suitable JDBC Driver not found

2015-08-29 Thread shawon
 0
down vote
favorite


I am using Apache Spark for analyzing query log. I already faced some
difficulties to setup spark. Now I am using a standalone cluster to process
queries.

First I used example code in java to count words that worked fine. But when
I try to connect it to a MySQL server problem arises. I am using ubuntu
14.04 LTS, 64 bit. Spark version 1.4.1, Mysql 5.1.

This is my code, when I use Master Url instead of [Local*] I get the error
No suitable driver found. I have included the log. 

Here's the StackOverFlow link to the Detail of the problem - 

http://stackoverflow.com/questions/32280276/apache-spark-mysql-connection-suitable-jdbc-driver-not-found?noredirect=1#comment52445586_32280276
http://http://stackoverflow.com/questions/32280276/apache-spark-mysql-connection-suitable-jdbc-driver-not-found?noredirect=1#comment52445586_32280276
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Suitable-JDBC-Driver-not-found-tp24505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Array Out OF Bound Exception

2015-08-29 Thread Raghavendra Pandey
So either you empty line at the end or when you use string.split you dont
specify -1 as second parameter...
On Aug 29, 2015 1:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

 I suspect in the last scenario you are having an empty new line at the
 last line. If you put a try..catch you'd definitely know.

 Thanks
 Best Regards

 On Tue, Aug 25, 2015 at 2:53 AM, Michael Armbrust mich...@databricks.com
 wrote:

 This top line here is indicating that the exception is being throw from
 your code (i.e. code written in the console).

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40)


 Check to make sure that you are properly handling data that has less
 columns than you would expect.



 On Mon, Aug 24, 2015 at 12:41 PM, SAHA, DEBOBROTA ds3...@att.com wrote:

 Hi ,



 I am using SPARK 1.4 and I am getting an array out of bound Exception
 when I am trying to read from a registered table in SPARK.



 For example If I have 3 different text files with the content as below:



 *Scenario 1*:

 A1|B1|C1

 A2|B2|C2



 *Scenario 2*:

 A1| |C1

 A2| |C2



 *Scenario 3*:

 A1| B1|

 A2| B2|



 So for Scenario 1 and 2 it’s working fine but for Scenario 3 I am
 getting the following error:



 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 3.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException: 2

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40)

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:38)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to
 (TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)

 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

 at org.apache.spark.scheduler.Task.run(Task.scala:70)

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)



 Driver stacktrace:

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)

 at
 

Re: How to set environment of worker applications

2015-08-29 Thread Jan Algermissen
Finally, I found the solution:

on the spark context you can set spark.executorEnv.[EnvironmentVariableName] 
and these will be available in the environment of the executors

This is in fact documented, but somehow I missed it.

https://spark.apache.org/docs/latest/configuration.html#runtime-environment

Thanks for all the help though.

Jan


On 25 Aug 2015, at 08:57, Hemant Bhanawat hemant9...@gmail.com wrote:

 Ok, I went in the direction of system vars since beginning probably because 
 the question was to pass variables to a particular job. 
 
 Anyway, the decision to use either system vars or environment vars would 
 solely depend on whether you want to make them available to all the spark 
 processes on a node or to a particular job. 
 
 Are there any other reasons why one would prefer one over the other? 
 
 
 On Mon, Aug 24, 2015 at 8:48 PM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:
 System properties and environment variables are two different things.. One 
 can use spark.executor.extraJavaOptions to pass system properties and 
 spark-env.sh to pass environment variables.
 
 -raghav
 
 On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat hemant9...@gmail.com wrote:
 That's surprising. Passing the environment variables using 
 spark.executor.extraJavaOptions=-Dmyenvvar=xxx to the executor and then 
 fetching them using System.getProperty(myenvvar) has worked for me. 
 
 What is the error that you guys got? 
 
 On Mon, Aug 24, 2015 at 12:10 AM, Sathish Kumaran Vairavelu 
 vsathishkuma...@gmail.com wrote:
 spark-env.sh works for me in Spark 1.4 but not 
 spark.executor.extraJavaOptions.
 
 On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:
 I think the only way to pass on environment variables to worker node is to 
 write it in spark-env.sh file on each worker node.
 
 On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com wrote:
 Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions 
 in the following article. I think you can use -D to pass system vars:
 
 spark.apache.org/docs/latest/configuration.html#runtime-environment
 
 Hi,
 
 I am starting a spark streaming job in standalone mode with spark-submit.
 
 Is there a way to make the UNIX environment variables with which spark-submit 
 is started available to the processes started on the worker nodes?
 
 Jan
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Raghavendra Pandey
We are using Cassandra for similar kind of problem and it works well... You
need to take care of race condition between updating the store and looking
up the store...
On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 +1 on Jason's suggestion.

 bq. this large variable is broadcast many times during the lifetime

 Please consider making this large variable more granular. Meaning, reduce
 the amount of data transferred between the key value store and your app
 during update.

 Cheers

 On Fri, Aug 28, 2015 at 12:44 PM, Jason ja...@jasonknight.us wrote:

 You could try using an external key value store (like HBase, Redis) and
 perform lookups/updates inside of your mappers (you'd need to create the
 connection within a mapPartitions code block to avoid the connection
 setup/teardown overhead)?

 I haven't done this myself though, so I'm just throwing the idea out
 there.

 On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff j...@atware.co.jp wrote:

 Hi,

 I am working on a Spark application that is using of a large (~3G)
 broadcast variable as a lookup table. The application refines the data in
 this lookup table in an iterative manner. So this large variable is
 broadcast many times during the lifetime of the application process.

 From what I have observed perhaps 60% of the execution time is spent
 waiting for the variable to broadcast in each iteration. My reading of a
 Spark performance article[1] suggests that the time spent broadcasting will
 increase with the number of nodes I add.

 My question for the group - what would you suggest as an alternative to
 broadcasting a large variable like this?

 One approach I have considered is segmenting my RDD and adding a copy of
 the lookup table for each X number of values to process. So, for example,
 if I have a list of 1 million entries to process (eg, RDD[Entry]), I could
 split this into segments of 100K entries, with a copy of the lookup table,
 and make that an RDD[(Lookup, Array[Entry]).

 Another solution I am looking at it is making the lookup table an RDD
 instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
 improve performance. One issue with this approach is that I would have to
 rewrite my application code to use two RDDs so that I do not reference the
 lookup RDD in the from within the closure of another RDD.

 Any other recommendations?

 Jeff


 [1]
 http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf

 [2]https://github.com/amplab/spark-indexedrdd





Re: Build k-NN graph for large dataset

2015-08-29 Thread Maruf Aytekin
Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
I hope It will help.

Maruf


On Thu, Aug 27, 2015 at 9:16 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Thank you all for these links. I'll check them.

 On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
 wrote:

 +1 to all of the above esp.  Dimensionality reduction and locality
 sensitive hashing / min hashing.

 There's also an algorithm implemented in MLlib called DIMSUM which was
 developed at Twitter for this purpose. I've been meaning to try it and
 would be interested to hear about results you get.

 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

 ​Charlie


 — Sent from Mailbox

 On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
 michaelma...@yahoo.com.invalid, wrote:

 Yes. And a paper that describes using grids (actually varying grids) is
 http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
  In
 the Spark GraphX In Action book that Robin East and I are writing, we
 implement a drastically simplified version of this in chapter 7, which
 should become available in the MEAP mid-September.
 http://www.manning.com/books/spark-graphx-in-action


 --

 If you don't want to compute all N^2 similarities, you need to implement
 some kind of blocking first. For example, LSH (locally sensitive hashing).
 A quick search gave this link to a Spark implementation:


 http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing



 On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 I'm trying to find an efficient way to build a k-NN graph for a large
 dataset. Precisely, I have a large set of high dimensional vector (say d
  1) and I want to build a graph where those high dimensional points
 are the vertices and each one is linked to the k-nearest neighbor based on
 some kind similarity defined on the vertex spaces.
 My problem is to implement an efficient algorithm to compute the weight
 matrix of the graph. I need to compute a N*N similarities and the only way
 I know is to use cartesian operation follow by map operation on RDD.
 But, this is very slow when the N is large. Is there a more cleaver way to
 do this for an arbitrary similarity function ?

 Cheers,

 Jao








Re: Spark-on-YARN LOCAL_DIRS location

2015-08-29 Thread Akhil Das
Yes, you can set the SPARK_LOCAL_DIR in the spark-env.sh or spark.local.dir
in the spark-defaults.conf file, then it would use this location for the
shuffle writes etc.

Thanks
Best Regards

On Wed, Aug 26, 2015 at 6:56 PM, michael.engl...@nomura.com wrote:

 Hi,



 I am having issues with /tmp space filling up during Spark jobs because
 Spark-on-YARN uses the yarn.nodemanager.local-dirs for shuffle space. I
 noticed this message appears when submitting Spark-on-YARN jobs:



 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden
 by the value set by the cluster manager (via SPARK_LOCAL_DIRS in
 mesos/standalone and LOCAL_DIRS in YARN).



 I can’t find much documentation on where to set the LOCAL_DIRS property.
 Please can someone advise whether this is a yarn-env.sh or a spark-env.sh
 property and whether it would then use the directory specified by this env
 variable as a shuffle area instead of the default
 yarn.nodemanager.local-dirs location?



 Thanks,

 Mike

 This e-mail (including any attachments) is private and confidential, may
 contain proprietary or privileged information and is intended for the named
 recipient(s) only. Unintended recipients are strictly prohibited from
 taking action on the basis of information in this e-mail and must contact
 the sender immediately, delete this e-mail (and all attachments) and
 destroy any hard copies. Nomura will not accept responsibility or liability
 for the accuracy or completeness of, or the presence of any virus or
 disabling code in, this e-mail. If verification is sought please request a
 hard copy. Any reference to the terms of executed transactions should be
 treated as preliminary only and subject to formal written confirmation by
 Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
 communications through its networks (subject to and in accordance with
 applicable laws). No confidentiality or privilege is waived or lost by
 Nomura by any mistransmission of this e-mail. Any reference to Nomura is
 a reference to any entity in the Nomura Holdings, Inc. group. Please read
 our Electronic Communications Legal Notice which forms part of this e-mail:
 http://www.Nomura.com/email_disclaimer.htm



Re: Is there a way to store RDD and load it with its original format?

2015-08-29 Thread Akhil Das
You can do a rdd.saveAsObjectFile for storing and for reading you can do a
sc.objectFile

Thanks
Best Regards

On Thu, Aug 27, 2015 at 9:29 PM, saif.a.ell...@wellsfargo.com wrote:

 Hi,

 Any way to store/load RDDs keeping their original object instead of string?

 I am having trouble with parquet (there is always some error at class
 conversion), and don’t use hadoop. Looking for alternatives.

 Thanks in advance
 Saif




Re: Spark MLLIB multiclass calssification

2015-08-29 Thread Zsombor Egyed
Thank you, I saw this before, but it is just a binary classification, so
how can I extract this to multiple classification.

Simply add different labels?
e.g.:

  new LabeledDocument(0L, a b c d e spark, 1.0),
  new LabeledDocument(1L, b d, 0.0),
  new LabeledDocument(2L, hadoop f g h, 2.0),




On Sun, Aug 30, 2015 at 7:32 AM, Feynman Liang fli...@databricks.com
wrote:

 I would check out the Pipeline code example
 https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline

 On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net
 wrote:

 Hi!

 I want to implement a multiclass classification for documents.
 So I have different kinds of text files, and I want to classificate them
 with spark mllib in java.

 Do you have any code examples?

 Thanks!

 --


 *Egyed Zsombor *
 Junior Big Data Engineer



 Mobile: +36 70 320 65 81 | Twitter:@starschemaltd

 Email: egye...@starschema.net bali...@starschema.net | Web:
 www.starschema.net





-- 


*Egyed Zsombor *
Junior Big Data Engineer



Mobile: +36 70 320 65 81 | Twitter:@starschemaltd

Email: egye...@starschema.net bali...@starschema.net | Web:
www.starschema.net


Re: How to avoid shuffle errors for a large join ?

2015-08-29 Thread Reynold Xin
Can you try 1.5? This should work much, much better in 1.5 out of the box.

For 1.4, I think you'd want to turn on sort-merge-join, which is off by
default. However, the sort-merge join in 1.4 can still trigger a lot of
garbage, making it slower. SMJ performance is probably 5x - 1000x better in
1.5 for your case.


On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak tom...@gmail.com wrote:

 I'm getting errors like Removing executor with no recent heartbeats 
 Missing an output location for shuffle errors for a large SparkSql join
 (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to
 configure the job to avoid them.

 The initial stage completes fine with some 30k tasks on a cluster with 70
 machines/10TB memory, generating about 6.5TB of shuffle writes, but then
 the shuffle stage first waits 30min in the scheduling phase according to
 the UI, and then dies with the mentioned errors.

 I can see in the GC logs that the executors reach their memory limits (32g
 per executor, 2 workers per machine) and can't allocate any more stuff in
 the heap. Fwiw, the top 10 in the memory use histogram are:

 num #instances #bytes  class name
 --
1: 24913959511958700560
  scala.collection.immutable.HashMap$HashMap1
2: 251085327 8034730464  scala.Tuple2
3: 243694737 5848673688  java.lang.Float
4: 231198778 5548770672  java.lang.Integer
5:  72191585 4298521576  [Lscala.collection.immutable.HashMap;
6:  72191582 2310130624
  scala.collection.immutable.HashMap$HashTrieMap
7:  74114058 1778737392  java.lang.Long
8:   6059103  779203840  [Ljava.lang.Object;
9:   5461096  174755072  scala.collection.mutable.ArrayBuffer
   10: 34749   70122104  [B

 Relevant settings are (Spark 1.4.1, Java 8 with G1 GC):

 spark.core.connection.ack.wait.timeout 600
 spark.executor.heartbeatInterval   60s
 spark.executor.memory  32g
 spark.mesos.coarse false
 spark.network.timeout  600s
 spark.shuffle.blockTransferService netty
 spark.shuffle.consolidateFiles true
 spark.shuffle.file.buffer  1m
 spark.shuffle.io.maxRetries6
 spark.shuffle.manager  sort

 The join is currently configured with spark.sql.shuffle.partitions=1000
 but that doesn't seem to help. Would increasing the partitions help ? Is
 there a formula to determine an approximate partitions number value for a
 join ?
 Any help with this job would be appreciated !

 cheers,
 Tom



Spark shell and StackOverFlowError

2015-08-29 Thread ashrowty
I am running the Spark shell (1.2.1) in local mode and I have a simple
RDD[(String,String,Double)] with about 10,000 objects in it. I get a
StackOverFlowError each time I try to run the following code (the code
itself is just representative of other logic where I need to pass in a
variable). I tried broadcasting the variable too, but no luck .. missing
something basic here -

val rdd = sc.makeRDD(List(Data read from file)
val a=10
rdd.map(r = if (a==10) 1 else 0)
This throws -

java.lang.StackOverflowError
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:318)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
...
...

More experiments  .. this works -

val lst = Range(0,1).map(i=(10,10,i:Double)).toList
sc.makeRDD(lst).map(i= if(a==10) 1 else 0)

But below doesn't and throws the StackoverflowError -

val lst = MutableList[(String,String,Double)]()
Range(0,1).foreach(i=lst+=((10,10,i:Double)))
sc.makeRDD(lst).map(i= if(a==10) 1 else 0)

Any help appreciated!

Thanks,
Ashish



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Setting number of CORES from inside the Topology (JAVA code )

2015-08-29 Thread Akhil Das
When you set .setMaster to local[4], it means that you are allocating 4
threads on your local machine. You can change it to local[1] to run it on a
single thread.

If you are submitting the job to a standalone spark cluster and you wanted
to limit the # cores for your job, then you can do it like
*sparkConf.set(spark.cores.max,
224)*

Thanks
Best Regards

On Wed, Aug 26, 2015 at 7:26 PM, anshu shukla anshushuk...@gmail.com
wrote:

 Hey ,

 I  need to set the number of cores from inside the topology . Its working
 fine  by setting in  spark-env.sh but  unable to do  via setting key/value
 for  conf .

 SparkConf sparkConf = new 
 SparkConf().setAppName(JavaCustomReceiver).setMaster(local[4]);

 if(toponame.equals(IdentityTopology))
 {
 sparkConf.setExecutorEnv(SPARK_WORKER_CORES,1);
 }




 --
 Thanks  Regards,
 Anshu Shukla



Spark MLLIB multiclass calssification

2015-08-29 Thread Zsombor Egyed
Hi!

I want to implement a multiclass classification for documents.
So I have different kinds of text files, and I want to classificate them
with spark mllib in java.

Do you have any code examples?

Thanks!

-- 


*Egyed Zsombor *
Junior Big Data Engineer



Mobile: +36 70 320 65 81 | Twitter:@starschemaltd

Email: egye...@starschema.net bali...@starschema.net | Web:
www.starschema.net


Re: How to remove worker node but let it finish first?

2015-08-29 Thread Romi Kuntsman
It's only available in Mesos?
I'm using spark standalone cluster, is there anything about it there?

On Fri, Aug 28, 2015 at 8:51 AM Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can create a custom mesos framework for your requirement, to get you
 started you can check this out
 http://mesos.apache.org/documentation/latest/app-framework-development-guide/

 Thanks
 Best Regards

 On Mon, Aug 24, 2015 at 12:11 PM, Romi Kuntsman r...@totango.com wrote:

 Hi,
 I have a spark standalone cluster with 100s of applications per day, and
 it changes size (more or less workers) at various hours. The driver runs on
 a separate machine outside the spark cluster.

 When a job is running and it's worker is killed (because at that hour the
 number of workers is reduced), it sometimes fails, instead of
 redistributing the work to other workers.

 How is it possible to decomission a worker, so that it doesn't receive
 any new work, but does finish all existing work before shutting down?

 Thanks!





Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Hemminger Jeff
Thanks for the recommendations. I had been focused on solving the problem
within Spark but a distributed database sounds like a better solution.

Jeff

On Sat, Aug 29, 2015 at 11:47 PM, Ted Yu yuzhih...@gmail.com wrote:

 Not sure if the race condition you mentioned is related to Cassandra's
 data consistency model.

 If hbase is used as the external key value store, atomicity is guaranteed.

 Cheers

 On Sat, Aug 29, 2015 at 7:40 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 We are using Cassandra for similar kind of problem and it works well...
 You need to take care of race condition between updating the store and
 looking up the store...
 On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote:

 +1 on Jason's suggestion.

 bq. this large variable is broadcast many times during the lifetime

 Please consider making this large variable more granular. Meaning,
 reduce the amount of data transferred between the key value store and
 your app during update.

 Cheers

 On Fri, Aug 28, 2015 at 12:44 PM, Jason ja...@jasonknight.us wrote:

 You could try using an external key value store (like HBase, Redis) and
 perform lookups/updates inside of your mappers (you'd need to create the
 connection within a mapPartitions code block to avoid the connection
 setup/teardown overhead)?

 I haven't done this myself though, so I'm just throwing the idea out
 there.

 On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff j...@atware.co.jp
 wrote:

 Hi,

 I am working on a Spark application that is using of a large (~3G)
 broadcast variable as a lookup table. The application refines the data in
 this lookup table in an iterative manner. So this large variable is
 broadcast many times during the lifetime of the application process.

 From what I have observed perhaps 60% of the execution time is spent
 waiting for the variable to broadcast in each iteration. My reading of a
 Spark performance article[1] suggests that the time spent broadcasting 
 will
 increase with the number of nodes I add.

 My question for the group - what would you suggest as an alternative
 to broadcasting a large variable like this?

 One approach I have considered is segmenting my RDD and adding a copy
 of the lookup table for each X number of values to process. So, for
 example, if I have a list of 1 million entries to process (eg, 
 RDD[Entry]),
 I could split this into segments of 100K entries, with a copy of the 
 lookup
 table, and make that an RDD[(Lookup, Array[Entry]).

 Another solution I am looking at it is making the lookup table an RDD
 instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
 improve performance. One issue with this approach is that I would have to
 rewrite my application code to use two RDDs so that I do not reference the
 lookup RDD in the from within the closure of another RDD.

 Any other recommendations?

 Jeff


 [1]
 http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf

 [2]https://github.com/amplab/spark-indexedrdd






Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-08-29 Thread timothy22000
I am doing some memory tuning on my Spark job on YARN and I notice different
settings would give different results and affect the outcome of the Spark
job run. However, I am confused and do not understand completely why it
happens and would appreciate if someone can provide me with some guidance
and explanation. 

I will provide some background information and describe the cases that I
have experienced and post my questions after them below.

*My environment setting were as below:*

 - Memory 20G, 20 VCores per node (3 nodes in total)
 - Hadoop 2.6.0
 - Spark 1.4.0

My code recursively filters an RDD to make it smaller (removing examples as
part of an algorithm), then does mapToPair and collect to gather the results
and save them within a list.

 First Case 
 
/`/bin/spark-submit --class class name --master yarn-cluster
--driver-memory 7g --executor-memory 1g --num-executors 3 --executor-cores 1
--jars jar file`
/
If I run my program with any driver memory less than 11g, I will get the
error below which is the SparkContext being stopped or a similar error which
is a method being called on a stopped SparkContext. From what I have
gathered, this is related to memory not being enough.


http://apache-spark-user-list.1001560.n3.nabble.com/file/n24507/EKxQD.png 

Second Case
 

/`/bin/spark-submit --class class name --master yarn-cluster
--driver-memory 7g --executor-memory 3g --num-executors 3 --executor-cores 1
--jars jar file`/

If I run the program with the same driver memory but higher executor memory,
the job runs longer (about 3-4 minutes) than the first case and then it will
encounter a different error from earlier which is a Container
requesting/using more memory than allowed and is being killed because of
that. Although I find it weird since the executor memory is increased and
this error occurs instead of the error in the first case.

http://apache-spark-user-list.1001560.n3.nabble.com/file/n24507/tr24f.png 

Third Case
  

/`/bin/spark-submit --class class name --master yarn-cluster
--driver-memory 11g --executor-memory 1g --num-executors 3 --executor-cores
1 --jars jar file`/

Any setting with driver memory greater than 10g will lead to the job being
able to run successfully.

Fourth Case
 

/`/bin/spark-submit --class class name --master yarn-cluster
--driver-memory 2g --executor-memory 1g --conf
spark.yarn.executor.memoryOverhead=1024 --conf
spark.yarn.driver.memoryOverhead=1024 --num-executors 3 --executor-cores 1
--jars jar file`
/
The job will run successfully with this setting (driver memory 2g and
executor memory 1g but increasing the driver memory overhead(1g) and the
executor memory overhead(1g).

Questions


 1. Why is a different error thrown and the job runs longer (for the second
case) between the first and second case with only the executor memory being
increased? Are the two errors linked in some way?

 2. Both the third and fourth case succeeds and I understand that it is
because I am giving more memory which solves the memory problems. However,
in the third case,

/spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory that
YARN will create a JVM
= 11g + (driverMemory * 0.07, with minimum of 384m) 
= 11g + 1.154g
= 12.154g/

So, from the formula, I can see that my job requires MEMORY_TOTAL of around
12.154g to run successfully which explains why I need more than 10g for the
driver memory setting.

But for the fourth case, 

/
spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory that
YARN will create a JVM
= 2 + (driverMemory * 0.07, with minimum of 384m) 
= 2g + 0.524g
= 2.524g
/

It seems that just by increasing the memory overhead by a small amount of
1024(1g) it leads to the successful run of the job with driver memory of
only 2g and the MEMORY_TOTAL is only 2.524g! Whereas without the overhead
configuration, driver memory less than 11g fails but it doesn't make sense
from the formula which is why I am confused.

Why increasing the memory overhead (for both driver and executor) allows my
job to complete successfully with a lower MEMORY_TOTAL (12.154g vs 2.524g)?
Is there some other internal things at work here that I am missing?

I would really appreciate any helped offered as it would really help with my
understanding of Spark. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Effects-of-Driver-Memory-Executor-Memory-Driver-Memory-Overhead-and-Executor-Memory-Overhead-os-tp24507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-submit issue

2015-08-29 Thread Akhil Das
Did you try putting a sc.stop at the end of your pipeline?

Thanks
Best Regards

On Thu, Aug 27, 2015 at 6:41 PM, pranay pranay.ton...@impetus.co.in wrote:

 I have a java program that does this - (using Spark 1.3.1 ) Create a
 command
 string that uses spark-submit in it ( with my Class file etc ), and i
 store this string in a temp file somewhere as a shell script Using
 Runtime.exec, i execute this script and wait for its completion, using
 process.waitFor Doing ps -ef shows me SparkSubmitDriverBootstrapper , the
 script running my class ... parent child relationship..

 The job gets triggered on spark-cluster and gets over but
 SparkSubmitDriverBootstrapper still shows up, due to this the
 process.waitFor never comes out and i can't detect the execution end...

 If i run the /temp file independently. things work file... only when i
 trigger /temp scrict inside Runtime.exec , this issue occurs... Any
 comments
 ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-issue-tp24474.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: commit DB Transaction for each partition

2015-08-29 Thread Akhil Das
What problem are you having? you will have to trigger an action at the end
to execute this piece of code. Like:

rdd.mapPartitions(partitionOfRecords = {

DBConnectionInit()

val results = partitionOfRecords.map(..)

DBConnection.commit()

results

})*.count()*



Thanks
Best Regards

On Thu, Aug 27, 2015 at 7:32 PM, Ahmed Nawar ahmed.na...@gmail.com wrote:

 Dears,

 I needs to commit DB Transaction for each partition,Not for each row.
 below didn't work for me.


 rdd.mapPartitions(partitionOfRecords = {

 DBConnectionInit()

 val results = partitionOfRecords.map(..)

 DBConnection.commit()

 results

 })



 Best regards,

 Ahmed Atef Nawwar

 Data Management  Big Data Consultant



Re: How to generate spark assembly (jar file) using Intellij

2015-08-29 Thread Feynman Liang
Have you tried `build/sbt assembly`?

On Sat, Aug 29, 2015 at 9:03 PM, Muler mulugeta.abe...@gmail.com wrote:

 Hi guys,

 I can successfully build Spark using Intellij, but I'm not able to
 locate/generate spark assembly (jar file) in the assembly/target directly)
 How do I generate one? I have attached the screenshot of my IDE.



 Thanks,


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark MLLIB multiclass calssification

2015-08-29 Thread Feynman Liang
I would check out the Pipeline code example
https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline

On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net
wrote:

 Hi!

 I want to implement a multiclass classification for documents.
 So I have different kinds of text files, and I want to classificate them
 with spark mllib in java.

 Do you have any code examples?

 Thanks!

 --


 *Egyed Zsombor *
 Junior Big Data Engineer



 Mobile: +36 70 320 65 81 | Twitter:@starschemaltd

 Email: egye...@starschema.net bali...@starschema.net | Web:
 www.starschema.net




Re: Spark MLLIB multiclass calssification

2015-08-29 Thread Feynman Liang
I think the spark.ml logistic regression currently only supports 0/1
labels. If you need multiclass, I would suggest to look at either the
spark.ml decision trees. If you don't care too much for pipelines, then you
could use the spark.mllib logistic regression after featurizing.

On Sat, Aug 29, 2015 at 10:49 PM, Zsombor Egyed egye...@starschema.net
wrote:

 Thank you, I saw this before, but it is just a binary classification, so
 how can I extract this to multiple classification.

 Simply add different labels?
 e.g.:

   new LabeledDocument(0L, a b c d e spark, 1.0),
   new LabeledDocument(1L, b d, 0.0),
   new LabeledDocument(2L, hadoop f g h, 2.0),




 On Sun, Aug 30, 2015 at 7:32 AM, Feynman Liang fli...@databricks.com
 wrote:

 I would check out the Pipeline code example
 https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline

 On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net
 wrote:

 Hi!

 I want to implement a multiclass classification for documents.
 So I have different kinds of text files, and I want to classificate them
 with spark mllib in java.

 Do you have any code examples?

 Thanks!

 --


 *Egyed Zsombor *
 Junior Big Data Engineer



 Mobile: +36 70 320 65 81 | Twitter:@starschemaltd

 Email: egye...@starschema.net bali...@starschema.net | Web:
 www.starschema.net





 --


 *Egyed Zsombor *
 Junior Big Data Engineer



 Mobile: +36 70 320 65 81 | Twitter:@starschemaltd

 Email: egye...@starschema.net bali...@starschema.net | Web:
 www.starschema.net




Re: Invalid environment variable name when submitting job from windows

2015-08-29 Thread Akhil Das
I think you have to use the keyword *set* to set an environment variable in
windows. Check the section Setting environment variables from
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/ntcmds_shelloverview.mspx?mfr=true

Thanks
Best Regards

On Tue, Aug 25, 2015 at 1:25 PM, Yann ROBIN me.s...@gmail.com wrote:

 Hi,

 We have a spark standalone cluster running on linux.
 We have a job that we submit to the spark cluster on windows. When
 submitting this job using windows the execution failed with this error
 in the Notes java.lang.IllegalArgumentException: Invalid environment
 variable name: =::. When submitting from linux it works fine.

 I thought that this might be the result of one of the ENV variable on
 my system so I've modify the submit cmd to remove all env variable
 except the one needed by Java. This is the env before executing java
 command :
 ASSEMBLY_DIR=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib

 ASSEMBLY_DIR1=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\../assembly/target/scala-2.10

 ASSEMBLY_DIR2=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\../assembly/target/scala-2.11
 CLASS=org.apache.spark.deploy.SparkSubmit
 CLASSPATH=.;
 JAVA_HOME=C:\Program Files\Java\jre1.8.0_51
 LAUNCHER_OUTPUT=\spark-class-launcher-output-23386.txt

 LAUNCH_CLASSPATH=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar
 PYTHONHASHSEED=0
 RUNNER=C:\Program Files\Java\jre1.8.0_51\bin\java

 SPARK_ASSEMBLY_JAR=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar
 SPARK_CMD=C:\Program Files\Java\jre1.8.0_51\bin\java -cp

 c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\conf\;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\datanucleus-api-jdo-3.2.6.jar;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\datanucleus-core-3.2.10.jar;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\datanucleus-rdbms-3.2.9.jar
 org.apache.spark.deploy.SparkSubmit --master spark://172.16.8.21:7077
 --deploy-mode cluster --conf spark.driver.memory=4G --conf

 spark.driver.extraClassPath=/opt/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar
 --class com.publica.Accounts --verbose
 http://server/data-analytics/data-analytics.jar
 spark://172.16.8.21:7077 data-analysis
 http://server/data-analytics/data-analytics.jar 23 8 2015
 SPARK_ENV_LOADED=1
 SPARK_HOME=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..
 SPARK_SCALA_VERSION=2.10
 SystemRoot=C:\Windows
 user_conf_dir=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\..\conf

 _SPARK_ASSEMBLY=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar

 Is there a way to make this works ?

 --
 Yann

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Scala: Overload method by its class type

2015-08-29 Thread Akhil Das
This is more of a scala related question, have a look at the case classes
in scala http://www.scala-lang.org/old/node/107

Thanks
Best Regards

On Tue, Aug 25, 2015 at 6:55 PM, saif.a.ell...@wellsfargo.com wrote:

 Hi all,

 I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE }

 And on runtime, I create instances of this class with different AnyVal +
 String types, but the return type of some_method varies.

 I know I could do this with an implicit object, IF some_method received a
 type, but in this case, I need to have the TYPE defined on its class
 instance, so for example:

 val int_instance = new SomeClass[Int]
 val str_instance = new SomeClass[String]
 val result: Boolean = int_instance.some_method(args)  0   --- I
 expected INT here
 val result2: Boolean = str_instance.som_method(args) contains “asdfg”
  I expected STRING here.

 without compilation errors.

 Any ideas? I would like to implement something like this:

 class SomeClass[TYPE] {

 def some_method(args: Int): Int = {
 process_integer_overloaded_method
 }

 def some_method(args: Int): String = {
 process_string_overloaded_method
 }

 and so on.

 Any ideas? maybe store classe’s TYPE in a constructor instead as a
 variable somehow?

 Thanks
 Saif