Re: Efficient Key Structure in pairRDD

2014-11-11 Thread nsareen
Spark Dev / Users, help in this regard would be appreciated, we  are kind of
stuck at this point.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461p18557.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to change the default limiter for textFile function

2014-11-11 Thread Blind Faith
I am a newbie to spark, and I program in Python. I use textFile function to
make an RDD from a file. I notice that the default limiter is newline.
However I want to change this default limiter to something else. After
searching the web, I came to know about textinputformat.record.delimiter
property, but doesn't seem to have any effect, when I use it with
SparkConf. So my question is, how do I change the default limiter in python?


Re: Fwd: Executor Lost Failure

2014-11-11 Thread Ritesh Kumar Singh
Yes... found the output on web UI of the slave.

Thanks :)

On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:
  Tasks are now getting submitted, but many tasks don't happen.
  Like, after opening the spark-shell, I load a text file from disk and try
  printing its contentsas:
 
 sc.textFile(/path/to/file).foreach(println)
 
  It does not give me any output.

 That's because foreach launches tasks on the slaves. When each task tries
 to print its lines, they go to the stdout file on the slave rather than to
 your console at the driver. You should see the file's contents in each of
 the slaves' stdout files in the web UI.

 This only happens when running on a cluster. In local mode, all the tasks
 are running locally and can output to the driver, so foreach(println) is
 more useful.

 Ankur



Re: disable log4j for spark-shell

2014-11-11 Thread Ritesh Kumar Singh
go to your spark home and then into the conf/ directory and then edit the
log4j.properties file i.e. :

gedit $SPARK_HOME/conf/log4j.properties

and set root logger to:
   log4j.rootCategory=WARN, console

U don't need to build spark for the changes to take place. Whenever you
open spark-shel, it by default looks into the conf directories and loads
all the properties.

Thanks

On Tue, Nov 11, 2014 at 6:34 AM, lordjoe lordjoe2...@gmail.com wrote:

 public static void main(String[] args) throws Exception {
  System.out.println(Set Log to Warn);
 Logger rootLogger = Logger.getRootLogger();
 rootLogger.setLevel(Level.WARN);
 ...
  works for me




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Fwd: disable log4j for spark-shell

2014-11-11 Thread Ritesh Kumar Singh
-- Forwarded message --
From: Ritesh Kumar Singh riteshoneinamill...@gmail.com
Date: Tue, Nov 11, 2014 at 2:18 PM
Subject: Re: disable log4j for spark-shell
To: lordjoe lordjoe2...@gmail.com
Cc: u...@spark.incubator.apache.org


go to your spark home and then into the conf/ directory and then edit the
log4j.properties file i.e. :

gedit $SPARK_HOME/conf/log4j.properties

and set root logger to:
   log4j.rootCategory=WARN, console

U don't need to build spark for the changes to take place. Whenever you
open spark-shel, it by default looks into the conf directories and loads
all the properties.

Thanks

On Tue, Nov 11, 2014 at 6:34 AM, lordjoe lordjoe2...@gmail.com wrote:

 public static void main(String[] args) throws Exception {
  System.out.println(Set Log to Warn);
 Logger rootLogger = Logger.getRootLogger();
 rootLogger.setLevel(Level.WARN);
 ...
  works for me




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Is there any benchmark suite for spark?

2014-11-11 Thread Hu Liu
I need to do some testing on spark and looking for some open source
tools.Is there any benchmark suite for spark that covers common use cases
like HiBench of Intel(https://github.com/intel-hadoop/HiBench)?

-- 
Best Regards,
Hu Liu


Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

2014-11-11 Thread shahab
Hi,

I  have a spark application which uses Cassandra
connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load
data from Cassandra into spark.

Everything works fine in the local mode, when I run in my IDE. But when I
submit the application to be executed in standalone Spark server, I get the
following exception, (which is apparently related to Guava versions ???!).
Does any one know how to solve this?

I create a jar file of my spark application using assembly.bat, and the
followings is the dependencies I used:

I put the connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.ja
in the lib/ folder of my eclipse project thats why it is not included in
the dependencies

libraryDependencies ++= Seq(

org.apache.spark%% spark-catalyst% 1.1.0 %
provided,

org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

net.jpountz.lz4 % lz4 % 1.2.0,

org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-
api) exclude(javax.servlet, servlet-api),

com.datastax.cassandra % cassandra-driver-core % 2.0.4
intransitive(),

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(org.apache.hadoop, hadoop-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided,

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

com.github.nscala-time %% nscala-time % 1.0.0,

org.scalatest %% scalatest % 1.9.1 % test,

org.apache.spark %% spark-sql % 1.1.0 %  provided,

org.apache.spark %% spark-hive % 1.1.0 % provided,

org.json4s %% json4s-jackson % 3.2.5,

junit % junit % 4.8.1 % test,

org.slf4j % slf4j-api % 1.7.7,

org.slf4j % slf4j-simple % 1.7.7,

org.clapper %% grizzled-slf4j % 1.0.2,

log4j % log4j % 1.2.17,

 com.google.guava % guava  % 16.0

   )

best,

/Shahab
And this is the exception I get:

Exception in thread main
com.google.common.util.concurrent.ExecutionError:
java.lang.NoSuchMethodError:
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at
org.apache.spark.sql.cassandra.CassandraCatalog.lookupRelation(CassandraCatalog.scala:39)
at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org
$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:60)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:123)
at
org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookupRelation(CassandraSQLContext.scala:65)


JdbcRDD and ClassTag issue

2014-11-11 Thread nitinkalra2000
Hi All,

I am trying to access SQL Server through JdbcRDD. But getting error on
ClassTag place holder.

Here is the code which I wrote


public void readFromDB() {
String sql   = Select * from Table_1 where values = ? and 
values = ?;

class GetJDBCResult extends AbstractFunction1ResultSet, 
Integer {
public Integer apply(ResultSet rs) {
Integer result = null;
try {
result = rs.getInt(1);
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return result;  
   }

}

JdbcRDDInteger jdbcRdd = new JdbcRDDInteger(sc, 
jdbcInitialization(),
sql, 0, 120, 2, new GetJDBCResult(),
scala.reflect.ClassTag$.MODULE$.apply(Object.class));

}

Can anybody here recommend any solution to this ?

Thanks
Nitin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JdbcRDD-and-ClassTag-issue-tp18570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Where can I find logs from workers PySpark

2014-11-11 Thread jan.zikes
Hi, 

I am trying to do some logging in my PySpark jobs, particularly in map that is 
performed on workers. Unfortunately I am not able tofind these logs. Based on 
the documentation it seems that the logs should be on masters in the 
SPARK_KOME, directory work 
http://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging
 . But unfortunately I am can't see this directory.

So I would like to ask, how do you do logging on workers? Is it somehow 
possible to send this logs also to master, for example after the job fininshes?

Than you in advance for any suggestions and advices.

Best regards,
Jan 
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Jishnu Menath Prathap (WT01 - BAS)
Hi I am getting the following error while executing a scala_twitter program for 
spark
14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with 
error: java.lang.NoSuchMethodError: 
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
(TID 0)
java.lang.NoSuchMethodError: 
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at 
org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught 
exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.NoSuchMethodError: 
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at 
org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
1
I think it might be a dependency issue so sharing pom.xml too

dependencies
dependency
groupIdorg.twitter4j/groupId
artifactIdtwitter4j-core/artifactId
version3.0.3/version
/dependency
dependency
groupIdorg.apache.httpcomponents/groupId
artifactIdhttpclient/artifactId
version4.0-beta1/version
/dependency
dependency
groupIdorg.apache.httpcomponents/groupId
artifactIdhttpclient/artifactId
version4.3.5/version
/dependency
dependency
groupIdoauth.signpost/groupId
artifactIdsignpost-commonshttp4/artifactId
version1.2/version
/dependency
dependency
groupIdorg.scalatest/groupId
artifactIdscalatest_2.10/artifactId
version3.0.0-SNAP2/version
/dependency
dependency
groupIdcommons-io/groupId
artifactIdcommons-io/artifactId
version2.4/version
/dependency
dependency
groupIdjunit/groupId
artifactIdjunit/artifactId
version4.4/version
/dependency
dependency
groupIdorg.twitter4j/groupId
artifactIdtwitter4j-stream/artifactId
version3.0.3/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.10/artifactId

Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Akhil Das
You are not having the twitter4j jars in the classpath. While running it in
the cluster mode you need to ship those dependency jars.

You can do like:

sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar,
/home/akhld/jars/twitter4j-stream-3.0.3.jar)

You can make sure they are shipped by checking the Application WebUI (4040)
environment tab.


Thanks
Best Regards

On Tue, Nov 11, 2014 at 5:48 PM, Jishnu Menath Prathap (WT01 - BAS) 
jishnu.prat...@wipro.com wrote:

*Hi I am getting the following error while executing a scala_twitter
 program for spark*
 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor
 with error: java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
 14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage
 0.0 (TID 0)
 java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
 at
 org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler:
 Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
 java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
 at
 org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 1
 *I think it might be a dependency issue so sharing pom.xml too*

 dependencies
 dependency
 groupIdorg.twitter4j/groupId
 artifactIdtwitter4j-core/artifactId
 version3.0.3/version
 /dependency
 dependency
 groupIdorg.apache.httpcomponents/groupId
 artifactIdhttpclient/artifactId
 version4.0-beta1/version
 /dependency
 dependency
 groupIdorg.apache.httpcomponents/groupId
 artifactIdhttpclient/artifactId
 version4.3.5/version
 /dependency
 dependency
 groupIdoauth.signpost/groupId
 artifactIdsignpost-commonshttp4/artifactId
 version1.2/version
 /dependency
 dependency
 groupIdorg.scalatest/groupId
 artifactIdscalatest_2.10/artifactId
 version3.0.0-SNAP2/version
 /dependency
 dependency
 groupIdcommons-io/groupId
 artifactIdcommons-io/artifactId
 version2.4/version
 /dependency
   

save as file

2014-11-11 Thread Naveen Kumar Pokala
Hi,

I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?

How to do that? And how to mentions hdfs path in the program.


-Naveen




RE: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Jishnu Menath Prathap (WT01 - BAS)
Hi
Thank you Akhil for reply.
I am not using cluster mode I am doing in local mode
val sparkConf = new 
SparkConf().setAppName(TwitterPopularTags).setMaster(local).set(spark.eventLog.enabled,true)
Also is there anywhere documented which Twitter4j version to be 
used for different versions of spark.

Thanks  Regards
Jishnu Menath Prathap

From: Akhil [via Apache Spark User List] 
[mailto:ml-node+s1001560n18576...@n3.nabble.com]
Sent: Tuesday, November 11, 2014 6:30 PM
To: Jishnu Menath Prathap (WT01 - BAS)
Subject: Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

You are not having the twitter4j jars in the classpath. While running it in the 
cluster mode you need to ship those dependency jars.

You can do like:

sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar,/home/akhld/jars/twitter4j-stream-3.0.3.jar)

You can make sure they are shipped by checking the Application WebUI (4040) 
environment tab.


Thanks
Best Regards

On Tue, Nov 11, 2014 at 5:48 PM, Jishnu Menath Prathap (WT01 - BAS) [hidden 
email]/user/SendEmail.jtp?type=nodenode=18576i=0 wrote:
Hi I am getting the following error while executing a scala_twitter program for 
spark
14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with 
error: java.lang.NoSuchMethodError: 
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
(TID 0)
java.lang.NoSuchMethodError: 
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at 
org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught 
exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.NoSuchMethodError: 
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at 
org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
1
I think it might be a dependency issue so sharing pom.xml too

dependencies
dependency
groupIdorg.twitter4j/groupId
artifactIdtwitter4j-core/artifactId
version3.0.3/version
/dependency
dependency
groupIdorg.apache.httpcomponents/groupId
artifactIdhttpclient/artifactId
version4.0-beta1/version
/dependency
dependency
groupIdorg.apache.httpcomponents/groupId
artifactIdhttpclient/artifactId
version4.3.5/version
/dependency
dependency
groupIdoauth.signpost/groupId

Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ashic Mahtab
Hi,
I'm trying to submit a spark application fro network share to the spark master. 
Network shares are configured so that the master and all nodes have access to 
the target ja at (say):

\\shares\publish\Spark\app1\someJar.jar


And this is mounted on each linux box (i.e. master and workers) at:


/mnt/spark/app1/someJar.jar


I'm using the following to submit the app from a windows machine:


spark-submit.cmd --class Main --master spark://mastername:7077 
local:/mnt/spark/app1/someJar.jar


However, I get an error saying:


Warning: Local jar \mnt\spark\app1\someJar.jar does not exist, skipping.


Followed by a ClassNotFoundException: Main.


Notice the \'s instead of /s. 


Does anybody have experience getting something similar to work?


Regards,
Ashic.

  

Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Akhil Das
You can pick the dependency version from here
http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.10


Thanks
Best Regards

On Tue, Nov 11, 2014 at 6:36 PM, Jishnu Menath Prathap (WT01 - BAS) 
jishnu.prat...@wipro.com wrote:

  Hi

 Thank you Akhil for reply.

 I am not using cluster mode I am doing in local mode

 val sparkConf = new SparkConf().setAppName(TwitterPopularTags).
 *setMaster(local)*.set(spark.eventLog.enabled,true)

 Also is there anywhere documented which Twitter4j version to
 be used for different versions of spark.



 Thanks  Regards

 Jishnu Menath Prathap



 *From:* Akhil [via Apache Spark User List] [mailto:
 ml-node+s1001560n18576...@n3.nabble.com]
 *Sent:* Tuesday, November 11, 2014 6:30 PM
 *To:* Jishnu Menath Prathap (WT01 - BAS)
 *Subject:* Re: java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener



 You are not having the twitter4j jars in the classpath. While running it
 in the cluster mode you need to ship those dependency jars.



 You can do like:




 sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar,/home/akhld/jars/twitter4j-stream-3.0.3.jar)



 You can make sure they are shipped by checking the Application WebUI
 (4040) environment tab.




   Thanks

 Best Regards



 On Tue, Nov 11, 2014 at 5:48 PM, Jishnu Menath Prathap (WT01 - BAS) [hidden
 email] http://user/SendEmail.jtp?type=nodenode=18576i=0 wrote:

 *Hi I am getting the following error while executing a scala_twitter
 program for spark*
 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor
 with error: java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
 14/11/11 16:39:23 ERROR executor.Executor: Exception in task 0.0 in stage
 0.0 (TID 0)
 java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
 at
 org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 14/11/11 16:39:23 ERROR executor.ExecutorUncaughtExceptionHandler:
 Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
 java.lang.NoSuchMethodError:
 twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
 at
 org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)

 at
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)

 at
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 1
 *I think it might be a dependency issue so sharing pom.xml too*

 dependencies
 dependency
 groupIdorg.twitter4j/groupId
 artifactIdtwitter4j-core/artifactId
 version3.0.3/version
 /dependency
 dependency
 groupIdorg.apache.httpcomponents/groupId
 

Best practice for multi-user web controller in front of Spark

2014-11-11 Thread bethesda
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit.  Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes incrementally on
a daily basis.  (that detail is relevant to question 3 below)

Now we are ready to start building out the front-end, which will allow a
team of data scientists to submit their problems to the system via a web
front-end (web tier will be java).  Users could of course be submitting jobs
more or less simultaneously.  We want to make sure we understand how to best
structure this.  

Questions:  

1 - Does a new SparkContext get created in the web tier for each new request
for processing?  

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be shared from the context of one job to the context of subsequent jobs? 
Or does something like memcache have to be used?

Thanks!
David



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



subscribe

2014-11-11 Thread DAVID SWEARINGEN

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Tao Xiao
I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode,
which means the driver is not running on local node.

So how can I kill such a job? Is there a command like hadoop job -kill
job-id which kills a running MapReduce job ?

Thanks


Re: save as file

2014-11-11 Thread Akhil Das
One approach would be to use SaveAsNewAPIHadoop file and specify
jsonOutputFormat.

Another simple one would be like:

val rdd = sc.parallelize(1 to 100)
val json = rdd.map(x = {
  val m: Map[String, Int] = Map(id - x)
  new JSONObject(m) })

json.saveAsTextFile(output)


Thanks
Best Regards

On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.com wrote:

 Hi,



 I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?



 How to do that? And how to mentions hdfs path in the program.





 -Naveen







Re: save as file

2014-11-11 Thread Ritesh Kumar Singh
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output
to any location specified. The params to be provided are:
path of storage location
no. of partitions

For giving an hdfs path we use the following format:
/user/user-name/directory-to-sore/

On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.com wrote:

 Hi,



 I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?



 How to do that? And how to mentions hdfs path in the program.





 -Naveen







Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
But that requires an (unnecessary) load from disk.

I have run into this same issue, where we want to save intermediate results
but continue processing. The cache / persist feature of Spark doesn't seem
designed for this case. Unfortunately I'm not aware of a better solution
with the current version of Spark.

On Mon, Nov 10, 2014 at 5:15 PM, Sean Owen so...@cloudera.com wrote:

 Well you can always create C by loading B from disk, and likewise for
 E / D. No need for any custom procedure.

 On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote:
  When I have a multi-step process flow like this:
 
  A - B - C - D - E - F
 
  I need to store B and D's results into parquet files
 
  B.saveAsParquetFile
  D.saveAsParquetFile
 
  If I don't cache/persist any step, spark might recompute from A,B,C,D
 and E
  if something is wrong in F.
 
  Of course, I'd better cache all steps if I have enough memory to avoid
 this
  re-computation, or persist result to disk. But persisting B and D seems
  duplicate with saving B and D as parquet files.
 
  I'm wondering if spark can restore B and D from the parquet files using a
  customized persist and restore procedure?
 
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Ritesh Kumar Singh
There is a property :
   spark.ui.killEnabled
which needs to be set true for killing applications directly from the webUI.
Check the link:
Kill Enable spark job
http://spark.apache.org/docs/latest/configuration.html#spark-ui

Thanks

On Tue, Nov 11, 2014 at 7:42 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 The web interface has a kill link. You can try using that.

 Best Regards,
 Sonal
 Founder, Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal



 On Tue, Nov 11, 2014 at 7:28 PM, Tao Xiao xiaotao.cs@gmail.com
 wrote:

 I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode,
 which means the driver is not running on local node.

 So how can I kill such a job? Is there a command like hadoop job -kill
 job-id which kills a running MapReduce job ?

 Thanks





Re: Spark + Tableau

2014-11-11 Thread Bojan Kostic
I finally solved issue with Spark Tableau connection.
Thanks Denny Lee for blog post:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
Solution was to use Authentication type Username. And then use username for
metastore.

Best regards
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p18591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Tom Seddon
Hi,

Just wondering if anyone has any advice about this issue, as I am
experiencing the same thing.  I'm working with multiple broadcast variables
in PySpark, most of which are small, but one of around 4.5GB, using 10
workers at 31GB memory each and driver with same spec.  It's not running
out of memory as far as I can see, but definitely only happens when I add
the large broadcast.  Would be most grateful for advice.

I tried playing around with the last 3 conf settings below, but no luck:

SparkConf().set(spark.master.memory, 26)
.set(spark.executor.memory, 26)
.set(spark.worker.memory, 26)
.set(spark.driver.memory, 26).
.set(spark.storage.memoryFraction,1)
.set(spark.core.connection.ack.wait.timeout,6000)
.set(spark.akka.frameSize,50)

Thanks,

Tom


On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote:

 Hi All,

 I am relatively new to spark and currently having troubles with
 broadcasting
 large variables ~500mb in size. Th
 e broadcast fails with an error shown below and the memory usage on the
 hosts also blow up.

 Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers))
 and we are using Spark 1.1 (Python) via Cloudera CDH 5.2.

 We have managed to replicate the error using a test script shown below. I
 would be interested to know if anyone has seen this before with
 broadcasting
 or know of a fix.

 === ERROR ==

 14/10/24 08:20:04 INFO BlockManager: Found block rdd_11_31 locally
 14/10/24 08:20:08 INFO ConnectionManager: Key not valid ?
 sun.nio.ch.SelectionKeyImpl@fbc6caf
 14/10/24 08:20:08 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
 14/10/24 08:20:08 INFO ConnectionManager: key already cancelled ?
 sun.nio.ch.SelectionKeyImpl@fbc6caf
 java.nio.channels.CancelledKeyException
 at
 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at

 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
 14/10/24 08:20:13 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
 14/10/24 08:20:15 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
 14/10/24 08:20:15 INFO ConnectionManager: Key not valid ?
 sun.nio.ch.SelectionKeyImpl@3ecfdb7e
 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
 14/10/24 08:20:15 INFO ConnectionManager: key already cancelled ?
 sun.nio.ch.SelectionKeyImpl@3ecfdb7e
 java.nio.channels.CancelledKeyException
 at
 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at

 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
 14/10/24 08:20:27 ERROR SendingConnection: Exception while reading
 SendingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
 java.nio.channels.ClosedChannelException
 at
 sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
 at
 org.apache.spark.network.SendingConnection.read(Connection.scala:390)
 at

 org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/10/24 08:20:27 ERROR SendingConnection: Exception while reading
 SendingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
 java.nio.channels.ClosedChannelException
 at
 sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
   

Re: Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

2014-11-11 Thread Helena Edelson
Hi,
It looks like you are building from master 
(spark-cassandra-connector-assembly-1.2.0). 
- Append this to your com.google.guava declaration: % provided
- Be sure your version of the connector dependency is the same as the assembly 
build. For instance, if you are using 1.1.0-beta1, build your assembly with 
that vs master.
- You can upgrade your version of cassandra if that is plausible for your 
deploy environment, to 2.1.0. Side note: we are releasing 1.1.0-beta2 today or 
tomorrow which allows usage of Cassandra 2.1.1 and fixes any guava issues
- Make your version of cassandra server + dependencies match your cassandra 
driver version. You currently have 2.0.9 with 2.0.4
 

- Helena
@helenaedelson


On Nov 11, 2014, at 6:13 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,
 
 I  have a spark application which uses Cassandra 
 connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load data 
 from Cassandra into spark.
 
 Everything works fine in the local mode, when I run in my IDE. But when I 
 submit the application to be executed in standalone Spark server, I get the 
 following exception, (which is apparently related to Guava versions ???!). 
 Does any one know how to solve this?
 
 I create a jar file of my spark application using assembly.bat, and the 
 followings is the dependencies I used:
 
 I put the connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.ja in 
 the lib/ folder of my eclipse project thats why it is not included in the 
 dependencies
 libraryDependencies ++= Seq(
 
 org.apache.spark%% spark-catalyst% 1.1.0 % 
 provided,
 
 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),
 
 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),
 
 net.jpountz.lz4 % lz4 % 1.2.0,
 
 org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, 
 slf4j-api) exclude(javax.servlet, servlet-api),
 
 com.datastax.cassandra % cassandra-driver-core % 2.0.4 
 intransitive(),
 
 org.apache.spark %% spark-core % 1.1.0 % provided 
 exclude(org.apache.hadoop, hadoop-core),
 
 org.apache.spark %% spark-streaming % 1.1.0 % provided,
 
 org.apache.hadoop % hadoop-client % 1.0.4 % provided,
 
 com.github.nscala-time %% nscala-time % 1.0.0,
 
 org.scalatest %% scalatest % 1.9.1 % test,
 
 org.apache.spark %% spark-sql % 1.1.0 %  provided,
 
 org.apache.spark %% spark-hive % 1.1.0 % provided,
 
 org.json4s %% json4s-jackson % 3.2.5,
 
 junit % junit % 4.8.1 % test,
 
 org.slf4j % slf4j-api % 1.7.7,
 
 org.slf4j % slf4j-simple % 1.7.7,
 
 org.clapper %% grizzled-slf4j % 1.0.2,
 
 log4j % log4j % 1.2.17,
 
  com.google.guava % guava  % 16.0
 
 
)
 
 
 best,
 
 /Shahab
 And this is the exception I get:
 
 Exception in thread main com.google.common.util.concurrent.ExecutionError: 
 java.lang.NoSuchMethodError: 
 com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
 at 
 com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
 at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at 
 com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
 org.apache.spark.sql.cassandra.CassandraCatalog.lookupRelation(CassandraCatalog.scala:39)
 at 
 org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:60)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:123)
 at 
 org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookupRelation(CassandraSQLContext.scala:65)



Re: Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ritesh Kumar Singh
Never tried this form but just guessing,

What's the output when you submit this jar: \\shares\publish\Spark\app1\
someJar.jar
using spark-submit.cmd


Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-11-11 Thread Tom Seddon
Yes please can you share.  I am getting this error after expanding my
application to include a large broadcast variable. Would be good to know if
it can be fixed with configuration.

On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com
wrote:

 Can you list what your fix was so others can benefit?

 On Wed, Oct 22, 2014 at 8:15 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi,

 I have managed to resolve it because a wrong setting. Please ignore this .

 Regards
 Arthur

 On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:


 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up






scala.MatchError

2014-11-11 Thread Naveen Kumar Pokala
Hi,

This is my Instrument java constructor.

public Instrument(Issue issue, Issuer issuer, Issuing issuing) {
super();
this.issue = issue;
this.issuer = issuer;
this.issuing = issuing;
}


I am trying to create javaschemaRDD

JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, 
Instrument.class);

Remarks:


Instrument, Issue, Issuer, Issuing all are java classes

distData is holding List Instrument 


I am getting the following error.



Exception in thread Driver java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: scala.MatchError: class sample.spark.test.Issue (of class 
java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at sample.spark.test.SparkJob.main(SparkJob.java:33)
... 5 more

Please help me.

Regards,
Naveen.


Combining data from two tables in two databases postgresql, JdbcRDD.

2014-11-11 Thread akshayhazari
I want to be able to perform a query on two tables in different databases. I
want to know whether it can be done. I've heard about union of two RDD's but
here I want to connect to something like different partitions of a table.

Any help is appreciated


import java.io.Serializable;
//import org.junit.*;
//import static org.junit.Assert.*;
import scala.*;
import scala.runtime.AbstractFunction0;
import scala.runtime.AbstractFunction1;
import scala.runtime.*;
import scala.collection.mutable.LinkedHashMap;
//import static scala.collection.Map.Projection;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.rdd.*;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.api.java.JavaSQLContext;
import org.apache.spark.sql.api.java.JavaSchemaRDD;
import org.apache.spark.sql.api.java.Row;
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.StructType;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.Row;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import java.sql.*;
import java.util.*;
import com.mysql.jdbc.Driver;
import com.mysql.jdbc.*;
import java.io.*;
public class Spark_Mysql {
 
static class Z extends AbstractFunction0java.sql.Connection implements
Serializable
{
java.sql.Connection con;
public java.sql.Connection apply()
{

try {

con=DriverManager.getConnection(jdbc:mysql://localhost:3306/azkaban?user=azkabanpassword=password);
}
catch(Exception e)
{
e.printStackTrace();
}
return con;
}

}
static public class Z1 extends AbstractFunction1ResultSet,Integer
implements Serializable
{
int ret;
public Integer apply(ResultSet i) {

try{
ret=i.getInt(1);
}
catch(Exception e)
{e.printStackTrace();}
return ret;
}
}
public static void main(String[] args) throws Exception {
String arr[]=new String[1];

arr[0]=/home/hduser/Documents/Credentials/Newest_Credentials_AX/spark-1.1.0-bin-hadoop1/lib/mysql-connector-java-5.1.33-bin.jar;

JavaSparkContext ctx = new JavaSparkContext(new
SparkConf().setAppName(JavaSparkSQL).setJars(arr));
SparkContext sctx = new SparkContext(new
SparkConf().setAppName(JavaSparkSQL).setJars(arr));
JavaSQLContext sqlCtx = new JavaSQLContext(ctx);

  try 
{
Class.forName(com.mysql.jdbc.Driver);
}
catch(Exception ex) 
{
ex.printStackTrace();
System.exit(1);
}
  JdbcRDD rdd=new JdbcRDD(sctx,new Z(),SELECT * FROM spark WHERE ? = 
id
AND id = ?,0L, 1000L, 10,new
Z1(),scala.reflect.ClassTag$.MODULE$.AnyRef());
  
  rdd.saveAsTextFile(hdfs://127.0.0.1:9000/user/hduser/mysqlrdd); 
  rdd.saveAsTextFile(/home/hduser/mysqlrdd); 
  
}
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Combining-data-from-two-tables-in-two-databases-postgresql-JdbcRDD-tp18597.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal
I believe the Spark Job Server by Ooyala can help you share data across
multiple jobs, take a look at
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
seems to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote:

 We are relatively new to spark and so far have been manually submitting
 single jobs at a time for ML training, during our development process,
 using
 spark-submit.  Each job accepts a small user-submitted data set and
 compares
 it to every data set in our hdfs corpus, which only changes incrementally
 on
 a daily basis.  (that detail is relevant to question 3 below)

 Now we are ready to start building out the front-end, which will allow a
 team of data scientists to submit their problems to the system via a web
 front-end (web tier will be java).  Users could of course be submitting
 jobs
 more or less simultaneously.  We want to make sure we understand how to
 best
 structure this.

 Questions:

 1 - Does a new SparkContext get created in the web tier for each new
 request
 for processing?

 2 - If so, how much time should we expect it to take for setting up the
 context?  Our goal is to return a response to the users in under 10
 seconds,
 but if it takes many seconds to create a new context or otherwise set up
 the
 job, then we need to adjust our expectations for what is possible.  From
 using spark-shell one might conclude that it might take more than 10
 seconds
 to create a context, however it's not clear how much of that is
 context-creation vs other things.

 3 - (This last question perhaps deserves a post in and of itself:) if every
 job is always comparing some little data structure to the same HDFS corpus
 of data, what is the best pattern to use to cache the RDD's from HDFS so
 they don't have to always be re-constituted from disk?  I.e. how can RDD's
 be shared from the context of one job to the context of subsequent jobs?
 Or does something like memcache have to be used?

 Thanks!
 David



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark and Play

2014-11-11 Thread Akshat Aranya
Hi,

Sorry if this has been asked before; I didn't find a satisfactory answer
when searching.  How can I integrate a Play application with Spark?  I'm
getting into issues of akka-actor versions.  Play 2.2.x uses akka-actor
2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine
with Spark 1.1.0.  Is there something I should do with libraryDependencies
in my build.sbt to make it work?

Thanks,
Akshat


Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Sadhan Sood
Hi Cheng,

I made sure the only hive server running on the machine is
hivethriftserver2.

/usr/lib/jvm/default-java/bin/java -cp
/usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf
-Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master yarn
--jars reporting.jar spark-internal

The query I am running is a simple count(*): select count(*) from Xyz
where date_prefix=20141031 and pretty sure it's submitting a map reduce
job based on the spark logs:

TakesRest=false

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=number

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=number

In order to set a constant number of reducers:

  set mapreduce.job.reduces=number

14/11/11 16:23:17 INFO ql.Context: New scratch dir is
hdfs://fdsfdsfsdfsdf:9000/tmp/hive-ubuntu/hive_2014-11-11_16-23-17_333_5669798325805509526-2

Starting Job = job_1414084656759_0142, Tracking URL =
http://xxx:8100/proxy/application_1414084656759_0142/
http://t.signauxdix.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XYg2zGvG-W8rBGxP1p8d-TW64zBkx56dS1Dd58vwq02?t=http%3A%2F%2Fec2-54-83-34-89.compute-1.amazonaws.com%3A8100%2Fproxy%2Fapplication_1414084656759_0142%2Fsi=6222577584832512pi=626685a9-b628-43cc-91a1-93636171ce77

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414084656759_0142

On Mon, Nov 10, 2014 at 9:59 PM, Cheng Lian lian.cs@gmail.com wrote:

  Hey Sadhan,

 I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't
 even provide a Hive mode to let you execute queries against Hive. Would you
 please check whether there is an existing HiveServer2 running there? Spark
 SQL HiveThriftServer2 is just a Spark port of HiveServer2, and they share
 the same default listening port. I guess the Thrift server didn't start
 successfully because the HiveServer2 occupied the port, and your Beeline
 session was probably linked against HiveServer2.

 Cheng


 On 11/11/14 8:29 AM, Sadhan Sood wrote:

 I was testing out the spark thrift jdbc server by running a simple query
 in the beeline client. The spark itself is running on a yarn cluster.

 However, when I run a query in beeline - I see no running jobs in the
 spark UI(completely empty) and the yarn UI seem to indicate that the
 submitted query is being run as a map reduce job. This is probably also
 being indicated from the spark logs but I am not completely sure:

  2014-11-11 00:19:00,492 INFO  ql.Context
 (Context.java:getMRScratchDir(267)) - New scratch dir is
 hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-1

 2014-11-11 00:19:00,877 INFO  ql.Context
 (Context.java:getMRScratchDir(267)) - New scratch dir is
 hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2

 2014-11-11 00:19:04,152 INFO  ql.Context
 (Context.java:getMRScratchDir(267)) - New scratch dir is
 hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2

 2014-11-11 00:19:04,425 INFO  Configuration.deprecation
 (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.submit.replication
 is deprecated. Instead, use mapreduce.client.submit.file.replication

 2014-11-11 00:19:04,516 INFO  client.RMProxy
 (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager
 at :8032

 2014-11-11 00:19:04,607 INFO  client.RMProxy
 (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager
 at :8032

 2014-11-11 00:19:04,639 WARN  mapreduce.JobSubmitter
 (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option
 parsing not performed. Implement the Tool interface and execute your
 application with ToolRunner to remedy this

 2014-11-11 00:00:08,806 INFO  input.FileInputFormat
 (FileInputFormat.java:listStatus(287)) - Total input paths to process :
 14912

 2014-11-11 00:00:08,864 INFO  lzo.GPLNativeCodeLoader
 (GPLNativeCodeLoader.java:clinit(34)) - Loaded native gpl library

 2014-11-11 00:00:08,866 INFO  lzo.LzoCodec (LzoCodec.java:clinit(76)) -
 Successfully loaded  initialized native-lzo library [hadoop-lzo rev
 8e266e052e423af592871e2dfe09d54c03f6a0e8]

 2014-11-11 00:00:09,873 INFO  input.CombineFileInputFormat
 (CombineFileInputFormat.java:createSplits(413)) - DEBUG: Terminated node
 allocation with : CompletedNodes: 1, size left: 194541317

 2014-11-11 00:00:10,017 INFO  mapreduce.JobSubmitter
 (JobSubmitter.java:submitJobInternal(396)) - number of splits:615

 2014-11-11 00:00:10,095 INFO  mapreduce.JobSubmitter
 (JobSubmitter.java:printTokens(479)) - Submitting tokens for job:
 job_1414084656759_0115

 2014-11-11 00:00:10,241 INFO  impl.YarnClientImpl
 (YarnClientImpl.java:submitApplication(167)) 

What should be the number of partitions after a union and a subtractByKey

2014-11-11 Thread Darin McBeath
Assume the following where both updatePairRDD and deletePairRDD are both 
HashPartitioned.  Before the union, each one of these has 512 partitions.   The 
new created updateDeletePairRDD has 1024 partitions.  Is this the 
general/expected behavior for a union (the number of partitions to double)?
JavaPairRDDString,String updateDeletePairRDD = 
updatePairRDD.union(deletePairRDD);
Then a similar question for subtractByKey.  In the example below, 
baselinePairRDD is HashPartitioned (with 512 partitions).  We know from above 
that updateDeletePairRDD has 1024 partitions.  The newly created 
workSubtractBaselinePairRDD has 512 partitions.  This makes sense because we 
are only 'subtracting' records from the baselinePairRDD and one wouldn't think 
the number of partitions would increase.  Is this the general/expected behavior 
for a subractByKey?

JavaPairRDDString,String workSubtractBaselinePairRDD = 
baselinePairRDD.subtractByKey(updateDeletePairRDD);



Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com wrote:

 I believe the Spark Job Server by Ooyala can help you share data across
 multiple jobs, take a look at
 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
 seems to fit closely to what you need.

 Best Regards,
 Sonal
 Founder, Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal



 On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote:

 We are relatively new to spark and so far have been manually submitting
 single jobs at a time for ML training, during our development process,
 using
 spark-submit.  Each job accepts a small user-submitted data set and
 compares
 it to every data set in our hdfs corpus, which only changes incrementally
 on
 a daily basis.  (that detail is relevant to question 3 below)

 Now we are ready to start building out the front-end, which will allow a
 team of data scientists to submit their problems to the system via a web
 front-end (web tier will be java).  Users could of course be submitting
 jobs
 more or less simultaneously.  We want to make sure we understand how to
 best
 structure this.

 Questions:

 1 - Does a new SparkContext get created in the web tier for each new
 request
 for processing?

 2 - If so, how much time should we expect it to take for setting up the
 context?  Our goal is to return a response to the users in under 10
 seconds,
 but if it takes many seconds to create a new context or otherwise set up
 the
 job, then we need to adjust our expectations for what is possible.  From
 using spark-shell one might conclude that it might take more than 10
 seconds
 to create a context, however it's not clear how much of that is
 context-creation vs other things.

 3 - (This last question perhaps deserves a post in and of itself:) if
 every
 job is always comparing some little data structure to the same HDFS corpus
 of data, what is the best pattern to use to cache the RDD's from HDFS so
 they don't have to always be re-constituted from disk?  I.e. how can RDD's
 be shared from the context of one job to the context of subsequent jobs?
 Or does something like memcache have to be used?

 Thanks!
 David



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?

I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.

So I see several inconsistencies I don't understand:

(1) If I run it a couple times, as separate applications (i.e., reloading,
recaching, etc), I will get different %'s cached each time.  I've got about
5x as much memory as I need overall, so it isn't running out.  But one
time, 100% of the data will be cached; the next, 83%, the next, 92%, etc.

(2) Also, the data is very unevenly distributed. I've got 400 partitions,
and 4 workers (with, I believe, 3x replication), and on my last run, my
distribution was 165/139/25/71.  Is there any way to get spark to
distribute the tasks more evenly?

(3) If I run the problem several times in the same execution (to take
advantage of caching etc.), I get very inconsistent results.  My latest
try, I get:

   - 1st run: 3.1 min
   - 2nd run: 2 seconds
   - 3rd run: 8 minutes
   - 4th run: 2 seconds
   - 5th run: 2 seconds
   - 6th run: 6.9 minutes
   - 7th run: 2 seconds
   - 8th run: 2 seconds
   - 9th run: 3.9 minuts
   - 10th run: 8 seconds

I understand the difference for the first run; it was caching that time.
Later times, when it manages to work in 2 seconds, it's because all the
tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the
tasks end up with locality level ANY.  Why would that change when running
the exact same task twice in a row on cached data?

Any help or pointers that I could get would be much appreciated.


Thanks,

 -Nathan



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Xiangrui Meng
Vincenzo sent a PR and included k-means as an example. Sean is helping
review it. PMML standard is quite large. So we may start with simple
model export, like linear methods, then move forward to tree-based.
-Xiangrui

On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
 Hello Spark and MLLib folks,

 So a common problem in the real world of using machine learning is that some
 data analysis use tools like R, but the more data engineers out there will
 use more advanced systems like Spark MLLib or even Python Scikit Learn.

 In the real world, I want to have a system where multiple different
 modeling environments can learn from data / build models, represent the
 models in a common language, and then have a layer which just takes the
 model and run model.predict() all day long -- scores the models in other
 words.

 It looks like the project openscoring.io and jpmml-evaluator are some
 amazing systems for this, but they fundamentally use PMML as the model
 representation here.

 I have read some JIRA tickets that Xiangrui Meng is interested in getting
 PMML implemented to export MLLib models, is that happening? Further, would
 something like Manish Amde's boosted ensemble tree methods be representable
 in PMML?

 Thank you!!
 Aris

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Davies Liu
There is a open PR [1] to support broadcast larger than 2G, could you try it?

[1] https://github.com/apache/spark/pull/2659

On Tue, Nov 11, 2014 at 6:39 AM, Tom Seddon mr.tom.sed...@gmail.com wrote:
 Hi,

 Just wondering if anyone has any advice about this issue, as I am
 experiencing the same thing.  I'm working with multiple broadcast variables
 in PySpark, most of which are small, but one of around 4.5GB, using 10
 workers at 31GB memory each and driver with same spec.  It's not running out
 of memory as far as I can see, but definitely only happens when I add the
 large broadcast.  Would be most grateful for advice.

 I tried playing around with the last 3 conf settings below, but no luck:

 SparkConf().set(spark.master.memory, 26)
 .set(spark.executor.memory, 26)
 .set(spark.worker.memory, 26)
 .set(spark.driver.memory, 26).
 .set(spark.storage.memoryFraction,1)
 .set(spark.core.connection.ack.wait.timeout,6000)
 .set(spark.akka.frameSize,50)

There is some invalid configs here, spark.master.memory and
spark.worker.memory are not valid.

spark.storage.memoryFraction is too large, then you will have not
memory for general use (such as shuffle).

The Python jobs run in separated processes, so you should leave some
memory for them in slaves. For example, if it has 8 CPUs, each Python
process will need at least 8G (for 4G broadcast plus object overhead
in Python), then you can use only 3 process in the same time, use
spark.cores.max=2 OR spark.task.cpus=4. Also you can only set
spark.executor.memory to 10G, leave 16G for Python.

Also, you could specify spark.python.worker.memory = 8G to have better
shuffle performance in Python. (it's not necessary)

So, for large broadcast, maybe you should use Scala, which uses
multiple threads, the broadcast will be shared by multiple tasks in
same executor.

 Thanks,

 Tom


 On 24 October 2014 12:31, htailor hemant.tai...@live.co.uk wrote:

 Hi All,

 I am relatively new to spark and currently having troubles with
 broadcasting
 large variables ~500mb in size. Th
 e broadcast fails with an error shown below and the memory usage on the
 hosts also blow up.

 Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb
 (workers))
 and we are using Spark 1.1 (Python) via Cloudera CDH 5.2.

 We have managed to replicate the error using a test script shown below. I
 would be interested to know if anyone has seen this before with
 broadcasting
 or know of a fix.

 === ERROR ==

 14/10/24 08:20:04 INFO BlockManager: Found block rdd_11_31 locally
 14/10/24 08:20:08 INFO ConnectionManager: Key not valid ?
 sun.nio.ch.SelectionKeyImpl@fbc6caf
 14/10/24 08:20:08 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
 14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
 14/10/24 08:20:08 INFO ConnectionManager: key already cancelled ?
 sun.nio.ch.SelectionKeyImpl@fbc6caf
 java.nio.channels.CancelledKeyException
 at

 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at

 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
 14/10/24 08:20:13 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
 14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
 14/10/24 08:20:15 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
 14/10/24 08:20:15 INFO ConnectionManager: Key not valid ?
 sun.nio.ch.SelectionKeyImpl@3ecfdb7e
 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
 14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
 14/10/24 08:20:15 INFO ConnectionManager: key already cancelled ?
 sun.nio.ch.SelectionKeyImpl@3ecfdb7e
 java.nio.channels.CancelledKeyException
 at

 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at

 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
 14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to
 ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
 14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to
 ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
 14/10/24 

Re: MLLib Decision Tress algorithm hangs, others fine

2014-11-11 Thread Xiangrui Meng
Could you provide more information? For example, spark version,
dataset size (number of instances/number of features), cluster size,
error messages from both the drive and the executor. -Xiangrui

On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote:
 Hello all,

 I have some text data that I am running different algorithms on.
 I had no problems with LibSVM and Naive Bayes on the same data,
 but when I run Decision Tree, the execution hangs in the middle
 of DecisionTree.trainClassifier(). The only difference from the example
 given on the site is that I am using 6 categories instead of 2, and the
 input is text that is transformed to labeled points using TF-IDF. It
 halts shortly after this log output:

 spark.SparkContext: Job finished: collect at DecisionTree.scala:1347, took
 1.019579676 s

 Any ideas as to what could be causing this?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tress-algorithm-hangs-others-fine-tp18515.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scala.MatchError

2014-11-11 Thread Xiangrui Meng
I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui

On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
 Hi,



 This is my Instrument java constructor.



 public Instrument(Issue issue, Issuer issuer, Issuing issuing) {

 super();

 this.issue = issue;

 this.issuer = issuer;

 this.issuing = issuing;

 }





 I am trying to create javaschemaRDD



 JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
 Instrument.class);



 Remarks:

 



 Instrument, Issue, Issuer, Issuing all are java classes



 distData is holding List Instrument 





 I am getting the following error.







 Exception in thread Driver java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:483)

 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

 Caused by: scala.MatchError: class sample.spark.test.Issue (of class
 java.lang.Class)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)

 at sample.spark.test.SparkJob.main(SparkJob.java:33)

 ... 5 more



 Please help me.



 Regards,

 Naveen.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scala.MatchError

2014-11-11 Thread Michael Armbrust
Xiangrui is correct that is must be a java bean, also nested classes are
not yet supported in java.

On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote:

 I think you need a Java bean class instead of a normal class. See
 example here:
 http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
 (switch to the java tab). -Xiangrui

 On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
 npok...@spcapitaliq.com wrote:
  Hi,
 
 
 
  This is my Instrument java constructor.
 
 
 
  public Instrument(Issue issue, Issuer issuer, Issuing issuing) {
 
  super();
 
  this.issue = issue;
 
  this.issuer = issuer;
 
  this.issuing = issuing;
 
  }
 
 
 
 
 
  I am trying to create javaschemaRDD
 
 
 
  JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
  Instrument.class);
 
 
 
  Remarks:
 
  
 
 
 
  Instrument, Issue, Issuer, Issuing all are java classes
 
 
 
  distData is holding List Instrument 
 
 
 
 
 
  I am getting the following error.
 
 
 
 
 
 
 
  Exception in thread Driver java.lang.reflect.InvocationTargetException
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
  at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
  at java.lang.reflect.Method.invoke(Method.java:483)
 
  at
 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 
  Caused by: scala.MatchError: class sample.spark.test.Issue (of class
  java.lang.Class)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
 
  at
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
  at
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
  at
 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 
  at
  scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 
  at
  scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 
  at
 scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
 
  at sample.spark.test.SparkJob.main(SparkJob.java:33)
 
  ... 5 more
 
 
 
  Please help me.
 
 
 
  Regards,
 
  Naveen.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Still struggling with building documentation

2014-11-11 Thread Alessandro Baretta
Nichols and Patrick,

Thanks for your help, but, no, it still does not work. The latest master
produces the following scaladoc errors:

[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
not found: type Type
[error]   protected Type type() { return Type.UPLOAD_BLOCK; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
not found: type Type
[error]   protected Type type() { return Type.STREAM_HANDLE; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
not found: type Type
[error]   protected Type type() { return Type.OPEN_BLOCKS; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
not found: type Type
[error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
[error] ^

...

[error] four errors found
[error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
[error] (spark/scalaunidoc:doc) Scaladoc generation failed
[error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM
Moving back into docs dir.
Making directory api/scala
cp -r ../target/scala-2.10/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Moving to python/docs directory and building sphinx.
Makefile:14: *** The 'sphinx-build' command was not found. Make sure you
have Sphinx installed, then set the SPHINXBUILD environment variable to
point to the full path of the 'sphinx-build' executable. Alternatively you
can add the directory with the executable to your PATH. If you don't have
Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.

Moving back into home dir.
Making directory api/python
cp -r python/docs/_build/html/. docs/api/python
/usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory
- python/docs/_build/html/. (Errno::ENOENT)
from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0'
from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r'
from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top
(required)'
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize'
from /usr/bin/jekyll:224:in `new'
from /usr/bin/jekyll:224:in `main'

What next?

Alex




On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I believe the web docs need to be built separately according to the
 instructions here
 https://github.com/apache/spark/blob/master/docs/README.md.

 Did you give those a shot?

 It's annoying to have a separate thing with new dependencies in order to
 build the web docs, but that's how it is at the moment.

 Nick

 On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com
 wrote:

 I finally came to realize that there is a special maven target to build
 the scaladocs, although arguably a very unintuitive on: mvn verify. So now
 I have scaladocs for each package, but not for the whole spark project.
 Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
 build/docs/api directory referenced in api.html is missing. How do I build
 it?

 Alex Baretta





Re: How to execute a function from class in distributed jar on each worker node?

2014-11-11 Thread aaronjosephs
I'm not sure that this will work but it makes sense to me. Basically you
write the functionality in a static block in a class and broadcast that
class. Not sure what your use case is but I need to load a native library
and want to avoid running the init in mapPartitions if it's not necessary
(just to make the code look cleaner)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-execute-a-function-from-class-in-distributed-jar-on-each-worker-node-tp3870p18611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



S3 table to spark sql

2014-11-11 Thread Franco Barrientos
How can i create a date field in spark sql? I have a S3 table and  i load it
into a RDD.

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

 

case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int,
sku: String, unidades: Double, monto: Double)

 

val tabla =
sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map
(p = trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString,
p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
p(6).trim.toDouble))

tabla.registerTempTable(trx_u3m)

 

Now my problema i show can i transform string variable into date variables
(fechau3m)?

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
Hello there,

I am wondering how to get the column family names and column qualifier names
when using pyspark to read an hbase table with multiple column families.

I have a hbase table as follows,
hbase(main):007:0 scan 'data1'
ROW   COLUMN+CELL   
 row1 column=f1:, timestamp=1411078148186, value=value1 
 row1 column=f2:, timestamp=1415732470877, value=value7 
 row2 column=f2:, timestamp=1411078160265, value=value2 

when I ran the examples/hbase_inputformat.py code: 
conf2 = {hbase.zookeeper.quorum: localhost,
hbase.mapreduce.inputtable: 'data1'}
hbase_rdd = sc.newAPIHadoopRDD(
org.apache.hadoop.hbase.mapreduce.TableInputFormat,
org.apache.hadoop.hbase.io.ImmutableBytesWritable,
org.apache.hadoop.hbase.client.Result,
   
keyConverter=org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter,
   
valueConverter=org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter,
conf=conf2)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
I only see 
(u'row1', u'value1')
(u'row2', u'value2')

What I really want is (row_id, column family:column qualifier, value)
tuples. Any comments? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
JPMML evaluator just changed their license to AGPL or commercial
license, and I think AGPL is not compatible with apache project. Any
advice?

https://github.com/jpmml/jpmml-evaluator

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote:
 Vincenzo sent a PR and included k-means as an example. Sean is helping
 review it. PMML standard is quite large. So we may start with simple
 model export, like linear methods, then move forward to tree-based.
 -Xiangrui

 On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
 Hello Spark and MLLib folks,

 So a common problem in the real world of using machine learning is that some
 data analysis use tools like R, but the more data engineers out there will
 use more advanced systems like Spark MLLib or even Python Scikit Learn.

 In the real world, I want to have a system where multiple different
 modeling environments can learn from data / build models, represent the
 models in a common language, and then have a layer which just takes the
 model and run model.predict() all day long -- scores the models in other
 words.

 It looks like the project openscoring.io and jpmml-evaluator are some
 amazing systems for this, but they fundamentally use PMML as the model
 representation here.

 I have read some JIRA tickets that Xiangrui Meng is interested in getting
 PMML implemented to export MLLib models, is that happening? Further, would
 something like Manish Amde's boosted ensemble tree methods be representable
 in PMML?

 Thank you!!
 Aris

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Hi,

Is there a way to extract only the English language tweets when using
TwitterUtils.createStream()? The filters argument specifies the strings
that need to be contained in the tweets, but I am not sure how this can be
used to specify the language.

thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Sean Owen
Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're
3-clause BSD:

https://github.com/jpmml/jpmml-model

So some of the scoring components are off-limits for an AL2 project but the
core model components are OK.

On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com wrote:

 JPMML evaluator just changed their license to AGPL or commercial
 license, and I think AGPL is not compatible with apache project. Any
 advice?

 https://github.com/jpmml/jpmml-evaluator

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote:
  Vincenzo sent a PR and included k-means as an example. Sean is helping
  review it. PMML standard is quite large. So we may start with simple
  model export, like linear methods, then move forward to tree-based.
  -Xiangrui
 
  On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
  Hello Spark and MLLib folks,
 
  So a common problem in the real world of using machine learning is that
 some
  data analysis use tools like R, but the more data engineers out there
 will
  use more advanced systems like Spark MLLib or even Python Scikit Learn.
 
  In the real world, I want to have a system where multiple different
  modeling environments can learn from data / build models, represent the
  models in a common language, and then have a layer which just takes the
  model and run model.predict() all day long -- scores the models in other
  words.
 
  It looks like the project openscoring.io and jpmml-evaluator are some
  amazing systems for this, but they fundamentally use PMML as the model
  representation here.
 
  I have read some JIRA tickets that Xiangrui Meng is interested in
 getting
  PMML implemented to export MLLib models, is that happening? Further,
 would
  something like Manish Amde's boosted ensemble tree methods be
 representable
  in PMML?
 
  Thank you!!
  Aris
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tathagata Das
You could get all the tweets in the stream, and then apply filter
transformation on the DStream of tweets to filter away non-english
tweets. The tweets in the DStream is of type twitter4j.Status which
has a field describing the language. You can use that in the filter.

Though in practice, a lot of non-english tweets are also marked as
english by Twitter. To really filter out ALL non-english tweets, you
will have to probably do some machine learning stuff to identify
English tweets.

On Tue, Nov 11, 2014 at 11:41 AM, SK skrishna...@gmail.com wrote:
 Hi,

 Is there a way to extract only the English language tweets when using
 TwitterUtils.createStream()? The filters argument specifies the strings
 that need to be contained in the tweets, but I am not sure how this can be
 used to specify the language.

 thanks





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Failed jobs showing as SUCCEEDED on web UI

2014-11-11 Thread Brett Meyer
I¹m running a Python script using spark-submit on YARN in an EMR cluster,
and if I have a job that fails due to ExecutorLostFailure or if I kill the
job, it still shows up on the web UI with a FinalStatus of SUCCEEDED.  Is
this due to PySpark, or is there potentially some other issue with the job
failure status not propagating to the logs?




smime.p7s
Description: S/MIME cryptographic signature


Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
checked the source, found the following,

class HBaseResultToStringConverter extends Converter[Any, String] {
  override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]
Bytes.toStringBinary(result.value())
  }
}

I feel using 'result.value()' here is a big limitation. Converting from the
'list()' from the 'Result' is more general and easy to use. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18619.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



failed to create a table with python (single node)

2014-11-11 Thread Pagliari, Roberto
I'm executing this example from the documentation (in single node mode)

# sc is an existing SparkContext.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

# Queries can be expressed in HiveQL.
results = sqlContext.sql(FROM src SELECT key, value).collect()




1.   would it be possible to get the results from collect in a more 
human-readable format? For example, I would like to  have a result similar to 
what I would get using hive CLI.

2.   The first query does not seem to create the table. I tried show 
tables; from hive after doing it, and the table src did not show up.


Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Thanks for the response. I tried the following :

   tweets.filter(_.getLang()=en)

I get a compilation error:
   value getLang is not a member of twitter4j.Status

But getLang() is one of the methods of twitter4j.Status since version 3.0.6
according to the doc at:
   http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--

What version of twitter4j does Spark Streaming use?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK

Small typo in my code in  the previous post. That should be: 
 tweets.filter(_.getLang()==en) 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



groupBy for DStream

2014-11-11 Thread SK

Hi.

1) I dont see a groupBy() method for a DStream object. Not sure why that is
not supported. Currently I am using filter () to separate out the different
groups. I would like to know if there is a way to convert a DStream object
to a regular RDD so that I can apply the RDD methods like groupBy.


2) The count() method for a DStream object returns a DStream[Long] instead
of a simple Long (like RDD does). How can I extract the simple Long count
value? I tried dstream(0) but got a compilation error that it does not take
parameters. I also tried dstream[0], but that also resulted in a compilation
error. I am not able to use the head() or take(0) method for DStream either.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-for-DStream-tp18623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: closure serialization behavior driving me crazy

2014-11-11 Thread Sandy Ryza
I tried turning on the extended debug info.  The Scala output is a little
opaque (lots of - field (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name:
$iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC), but it seems
like, as expected, somehow the full array of OLSMultipleLinearRegression
objects is getting pulled in.

I'm not sure I understand your comment about Array.ofDim being large.  When
serializing the array alone, it only takes up about 80K, which is close to
1867*5*sizeof(double).  The 400MB comes when referencing the array from a
function, which pulls in all the extra data.

Copying the global variable into a local one seems to work.  Much
appreciated, Matei.


On Mon, Nov 10, 2014 at 9:26 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey Sandy,

 Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the
 JVM to print the contents of the objects. In addition, something else that
 helps is to do the following:

 {
   val  _arr = arr
   models.map(... _arr ...)
 }

 Basically, copy the global variable into a local one. Then the field
 access from outside (from the interpreter-generated object that contains
 the line initializing arr) is no longer required, and the closure no longer
 has a reference to that.

 I'm really confused as to why Array.ofDim would be so large by the way,
 but are you sure you haven't flipped around the dimensions (e.g. it should
 be 5 x 1800)? A 5-double array will consume more than 5*8 bytes (probably
 something like 60 at least), and an array of those will still have a
 pointer to each one, so I'd expect that many of them to be more than 80 MB
 (which is very close to 1867*5*8).

 Matei

  On Nov 10, 2014, at 1:01 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
  I'm experiencing some strange behavior with closure serialization that
 is totally mind-boggling to me.  It appears that two arrays of equal size
 take up vastly different amount of space inside closures if they're
 generated in different ways.
 
  The basic flow of my app is to run a bunch of tiny regressions using
 Commons Math's OLSMultipleLinearRegression and then reference a 2D array of
 the results from a transformation.  I was running into OOME's and
 NotSerializableExceptions and tried to get closer to the root issue by
 calling the closure serializer directly.
scala val arr = models.map(_.estimateRegressionParameters()).toArray
 
  The result array is 1867 x 5. It serialized is 80k bytes, which seems
 about right:
scala SparkEnv.get.closureSerializer.newInstance().serialize(arr)
res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
 cap=80027]
 
  If I reference it from a simple function:
scala def func(x: Long) = arr.length
scala SparkEnv.get.closureSerializer.newInstance().serialize(func)
  I get a NotSerializableException.
 
  If I take pains to create the array using a loop:
scala val arr = Array.ofDim[Double](1867, 5)
scala for (s - 0 until models.length) {
| factorWeights(s) = models(s).estimateRegressionParameters()
| }
  Serialization works, but the serialized closure for the function is a
 whopping 400MB.
 
  If I pass in an array of the same length that was created in a different
 way, the size of the serialized closure is only about 90K, which seems
 about right.
 
  Naively, it seems like somehow the history of how the array was created
 is having an effect on what happens to it inside a closure.
 
  Is this expected behavior?  Can anybody explain what's going on?
 
  any insight very appreciated,
  Sandy




Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up
before the release:

https://issues.apache.org/jira/browse/SPARK-4326

On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta
alexbare...@gmail.com wrote:
 Nichols and Patrick,

 Thanks for your help, but, no, it still does not work. The latest master
 produces the following scaladoc errors:

 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
 not found: type Type
 [error]   protected Type type() { return Type.UPLOAD_BLOCK; }
 [error] ^
 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
 not found: type Type
 [error]   protected Type type() { return Type.STREAM_HANDLE; }
 [error] ^
 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
 not found: type Type
 [error]   protected Type type() { return Type.OPEN_BLOCKS; }
 [error] ^
 [error]
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
 not found: type Type
 [error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
 [error] ^

 ...

 [error] four errors found
 [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM
 Moving back into docs dir.
 Making directory api/scala
 cp -r ../target/scala-2.10/unidoc/. api/scala
 Making directory api/java
 cp -r ../target/javaunidoc/. api/java
 Moving to python/docs directory and building sphinx.
 Makefile:14: *** The 'sphinx-build' command was not found. Make sure you
 have Sphinx installed, then set the SPHINXBUILD environment variable to
 point to the full path of the 'sphinx-build' executable. Alternatively you
 can add the directory with the executable to your PATH. If you don't have
 Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.

 Moving back into home dir.
 Making directory api/python
 cp -r python/docs/_build/html/. docs/api/python
 /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory -
 python/docs/_build/html/. (Errno::ENOENT)
 from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest'
 from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0'
 from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest'
 from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r'
 from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top
 (required)'
 from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
 from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup'
 from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize'
 from /usr/bin/jekyll:224:in `new'
 from /usr/bin/jekyll:224:in `main'

 What next?

 Alex




 On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 I believe the web docs need to be built separately according to the
 instructions here.

 Did you give those a shot?

 It's annoying to have a separate thing with new dependencies in order to
 build the web docs, but that's how it is at the moment.

 Nick

 On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com
 wrote:

 I finally came to realize that there is a special maven target to build
 the scaladocs, although arguably a very unintuitive on: mvn verify. So now I
 have scaladocs for each package, but not for the whole spark project.
 Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
 build/docs/api directory referenced in api.html is missing. How do I build
 it?

 Alex Baretta




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There,

Because Akka versions are not binary compatible with one another, it
might not be possible to integrate Play with Spark 1.1.0.

- Patrick

On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote:
 Hi,

 Sorry if this has been asked before; I didn't find a satisfactory answer
 when searching.  How can I integrate a Play application with Spark?  I'm
 getting into issues of akka-actor versions.  Play 2.2.x uses akka-actor 2.0,
 whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine with
 Spark 1.1.0.  Is there something I should do with libraryDependencies in my
 build.sbt to make it work?

 Thanks,
 Akshat

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
I also worry about that the author of JPMML changed the license of
jpmml-evaluator due to his interest of his commercial business, and he
might change the license of jpmml-model in the future.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Nov 11, 2014 at 11:43 AM, Sean Owen so...@cloudera.com wrote:
 Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're
 3-clause BSD:

 https://github.com/jpmml/jpmml-model

 So some of the scoring components are off-limits for an AL2 project but the
 core model components are OK.

 On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com wrote:

 JPMML evaluator just changed their license to AGPL or commercial
 license, and I think AGPL is not compatible with apache project. Any
 advice?

 https://github.com/jpmml/jpmml-evaluator

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com wrote:
  Vincenzo sent a PR and included k-means as an example. Sean is helping
  review it. PMML standard is quite large. So we may start with simple
  model export, like linear methods, then move forward to tree-based.
  -Xiangrui
 
  On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
  Hello Spark and MLLib folks,
 
  So a common problem in the real world of using machine learning is that
  some
  data analysis use tools like R, but the more data engineers out there
  will
  use more advanced systems like Spark MLLib or even Python Scikit Learn.
 
  In the real world, I want to have a system where multiple different
  modeling environments can learn from data / build models, represent the
  models in a common language, and then have a layer which just takes the
  model and run model.predict() all day long -- scores the models in
  other
  words.
 
  It looks like the project openscoring.io and jpmml-evaluator are some
  amazing systems for this, but they fundamentally use PMML as the model
  representation here.
 
  I have read some JIRA tickets that Xiangrui Meng is interested in
  getting
  PMML implemented to export MLLib models, is that happening? Further,
  would
  something like Manish Amde's boosted ensemble tree methods be
  representable
  in PMML?
 
  Thank you!!
  Aris
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-11 Thread Harry Brundage
We've had success with Azkaban from LinkedIn over Oozie and Luigi.
http://azkaban.github.io/

Azkaban has support for many different job types, a fleshed out web UI with
decent log reporting, a decent failure / retry model, a REST api, and I
think support for multiple executor slaves is coming in the future. We
found Oozie's configuration and execution cumbersome, and Luigi immature.

On Tue, Nov 11, 2014 at 2:50 AM, Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi again,

 As Jimmy said, any thoughts about Luigi and/or any other tools? So far it
 seems that Oozie is the best and only choice here. Is that right?

 On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com
 wrote:

 I have used Oozie for all our workflows with Spark apps but you will have
 to use a java event as the workflow element.   I am interested in anyones
 experience with Luigi and/or any other tools.


 On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 I have some previous experience with Apache Oozie while I was developing
 in Apache Pig. Now, I am working explicitly with Apache Spark and I am
 looking for a tool with similar functionality. Is Oozie recommended? What
 about Luigi? What do you use \ recommend?




 --


 Nothing under the sun is greater than education. By educating one person
 and sending him/her into the society of his/her generation, we make a
 contribution extending a hundred generations to come.
 -Jigoro Kano, Founder of Judo-





concat two Dstreams

2014-11-11 Thread Josh J
Hi,

Is it possible to concatenate or append two Dstreams together? I have an
incoming stream that I wish to combine with data that's generated by a
utility. I then need to process the combined Dstream.

Thanks,
Josh


Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called union

On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Is it possible to concatenate or append two Dstreams together? I have an
 incoming stream that I wish to combine with data that's generated by a
 utility. I then need to process the combined Dstream.

 Thanks,
 Josh



Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav
simple

scala val date = new
java.text.SimpleDateFormat(mmdd).parse(fechau3m)

should work. Replace mmdd with the format fechau3m is in.

If you want to do it at case class level:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
//HiveContext always a good idea

import sqlContext.createSchemaRDD



case class trx_u3m(id: String, local: String, fechau3m: java.util.Date,
rubro: Int, sku: String, unidades: Double, monto: Double)



val tabla = 
sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map(p
= trx_u3m(p(0).trim.toString, p(1).trim.toString, new
java.text.SimpleDateFormat(mmdd).parse(p(2).trim.toString),
p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
p(6).trim.toDouble))

tabla.registerTempTable(trx_u3m)


On Tue, Nov 11, 2014 at 11:11 AM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

 How can i create a date field in spark sql? I have a S3 table and  i load
 it into a RDD.



 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 import sqlContext.createSchemaRDD



 case class trx_u3m(id: String, local: String, fechau3m: String, rubro:
 Int, sku: String, unidades: Double, monto: Double)



 val tabla = 
 sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map(p
 = trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString,
 p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
 p(6).trim.toDouble))

 tabla.registerTempTable(trx_u3m)



 Now my problema i show can i transform string variable into date variables
 (fechau3m)?



 *Franco Barrientos*
 Data Scientist

 Málaga #115, Of. 1003, Las Condes.
 Santiago, Chile.
 (+562)-29699649
 (+569)-76347893

 franco.barrien...@exalitica.com

 www.exalitica.com

 [image: http://exalitica.com/web/img/frim.png]





RE: Spark and Play

2014-11-11 Thread Mohammed Guller
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x

Here is a sample build.sbt file:

name := xyz

version := 0.1 

scalaVersion := 2.10.4

libraryDependencies ++= Seq(
  jdbc,
  anorm,
  cache,
  org.apache.spark %% spark-core % 1.1.0,
  com.typesafe.akka %% akka-actor % 2.2.3,
  com.typesafe.akka %% akka-slf4j % 2.2.3,
  org.apache.spark %% spark-sql % 1.1.0
)

play.Project.playScalaSettings


Mohammed

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Tuesday, November 11, 2014 2:06 PM
To: Akshat Aranya
Cc: user@spark.apache.org
Subject: Re: Spark and Play

Hi There,

Because Akka versions are not binary compatible with one another, it might not 
be possible to integrate Play with Spark 1.1.0.

- Patrick

On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote:
 Hi,

 Sorry if this has been asked before; I didn't find a satisfactory 
 answer when searching.  How can I integrate a Play application with 
 Spark?  I'm getting into issues of akka-actor versions.  Play 2.2.x 
 uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither 
 of which work fine with Spark 1.1.0.  Is there something I should do 
 with libraryDependencies in my build.sbt to make it work?

 Thanks,
 Akshat

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Mohammed Guller
David,

Here is what I would suggest:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?
Create a single SparkContext that gets shared across multiple web requests. 
Depending on the framework that you are using for the web-tier, it should not 
be difficult to create a global singleton object that  holds the SparkContext.

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be shared from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?
Create a cached RDD in a global singleton object, which gets accessed by 
multiple web requests. You could put the cached RDD in the same object that 
holds the SparkContext, if you would like. I need to know more details about 
the specifics of your application to be more specific, but hopefully you get 
the idea.


Mohammed

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Tuesday, November 11, 2014 8:54 AM
To: Sonal Goyal
Cc: bethesda; u...@spark.incubator.apache.org
Subject: Re: Best practice for multi-user web controller in front of Spark

For sharing RDDs across multiple jobs - you could also have a look at Tachyon. 
It provides an HDFS compatible in-memory storage layer that keeps data in 
memory across multiple jobs/frameworks - http://tachyon-project.org/.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal 
sonalgoy...@gmail.commailto:sonalgoy...@gmail.com wrote:
I believe the Spark Job Server by Ooyala can help you share data across 
multiple jobs, take a look at 
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems 
to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologieshttp://www.nubetech.co




On Tue, Nov 11, 2014 at 7:20 PM, bethesda 
swearinge...@mac.commailto:swearinge...@mac.com wrote:
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit.  Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes incrementally on
a daily basis.  (that detail is relevant to question 3 below)

Now we are ready to start building out the front-end, which will allow a
team of data scientists to submit their problems to the system via a web
front-end (web tier will be java).  Users could of course be submitting jobs
more or less simultaneously.  We want to make sure we understand how to best
structure this.

Questions:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be shared from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?

Thanks!
David



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
just wrote a custom convert in scala to replace HBaseResultToStringConverter.
Just couple of lines of code. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts.  I took a deeper look at BlockManager, RDD, and friends. 
Suppose one wanted to get native code access to un-deserialized blocks. 
This task looks very hard.  An RDD behaves much like a Scala iterator of
deserialized values, and interop with BlockManager is all on deserialized
data.  One would probably need to rewrite much of RDD, CacheManager, etc in
native code; an RDD subclass (e.g. PythonRDD) probably wouldn't work.

So exposing raw blocks to native code looks intractable.  I wonder how fast
Java/Kyro can SerDe of byte arrays.  E.g. suppose we have an RDDT where T
is immutable and most of the memory for a single T is a byte array.  What is
the overhead of SerDe-ing T?  (Does Java/Kyro copy the underlying memory?) 
If the overhead is small, then native access to raw blocks wouldn't really
yield any advantage.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347p18640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
I have an RDD of logs that look like this:

/no_cache/bi_event?Log=0pg_inst=517638988975678942pg=fow_mwever=c.2.1.8site=xyz.compid=156431807121222351rid=156431666543211500srch_id=156431666581865115row=6seq=1tot=1tsp=1cmp=thmb_12co_txt_url=Viewinget=clickthmb_type=pct=uc=579855lnx=SPGOOGBRANDCAMPref_url=http%3A%2F%2Fwww.abcd.com

The pairs are separated by , and the keys/values of each pair are
separated by =.   Hive has a str_to_map function
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
  
that will convert this String to a map that will make the following work:

mappedString[site] will return xyz.com

What's the most efficient way to do this in Scala + Spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Converting-Apache-log-string-into-map-using-delimiter-tp18641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



in function prototypes?

2014-11-11 Thread spr
I am creating a workflow;  I have an existing call to updateStateByKey that
works fine, but when I created a second use where the key is a Tuple2, it's
now failing with the dreaded overloaded method value updateStateByKey  with
alternatives ... cannot be applied to ...  Comparing the two uses I'm not
seeing anything that seems broken, though I do note that all the messages
below describe what the code provides as Time as
org.apache.spark.streaming.Time.

a) Could the Time v org.apache.spark.streaming.Time difference be causing
this?  (I'm using Time the same in the first use, which appears to work
properly.)

b) Any suggestions of what else could be causing the error?  

--code
val ssc = new StreamingContext(conf, Seconds(timeSliceArg))
ssc.checkpoint(.)

var lines = ssc.textFileStream(dirArg)

var linesArray = lines.map( line = (line.split(\t)))
var DnsSvr = linesArray.map( lineArray = (
 (lineArray(4), lineArray(5)),
 (1 , Time((lineArray(0).toDouble*1000).toLong) ))  )

val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int,
Time)]) = {
  val currentCount = if (values.isEmpty) 0 else values.map( x =
x._1).sum
  val newMinTime = if (values.isEmpty) Time(Long.MaxValue) else
values.map( x = x._2).min

  val (previousCount, minTime) = state.getOrElse((0,
Time(System.currentTimeMillis)))

  (currentCount + previousCount, Seq(minTime, newMinTime).min)
}

var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount)  
// === error here


--compilation output--
[error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method
value updateStateByKey with alternatives:
[error]   (updateFunc: Iterator[((String, String), Seq[(Int,
org.apache.spark.streaming.Time)], Option[(Int,
org.apache.spark.streaming.Time)])] = Iterator[((String, String), (Int,
org.apache.spark.streaming.Time))],partitioner:
org.apache.spark.Partitioner,rememberPartitioner: Boolean)(implicit
evidence$5: scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] and
[error]   (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int,
org.apache.spark.streaming.Time)],partitioner:
org.apache.spark.Partitioner)(implicit evidence$4:
scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] and
[error]   (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int,
org.apache.spark.streaming.Time)],numPartitions: Int)(implicit evidence$3:
scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] and
[error]   (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int,
org.apache.spark.streaming.Time)])(implicit evidence$2:
scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))]
[error]  cannot be applied to ((Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = (Int,
org.apache.spark.streaming.Time))
[error] var DnsSvrCum = DnsSvr.updateStateByKey[(Int,
Time)](updateDnsCount) 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/in-function-prototypes-tp18642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
OK I got it working with:

z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element
= element.split(=)(1))).toMap)

But I'm guessing there is a more efficient way than to create two separate
lists and then zip them together and then convert the result into a map.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Converting-Apache-log-string-into-map-using-delimiter-tp18641p18643.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-11 Thread spr
I am creating a workflow;  I have an existing call to updateStateByKey that
works fine, but when I created a second use where the key is a Tuple2, it's
now failing with the dreaded overloaded method value updateStateByKey  with
alternatives ... cannot be applied to ...  Comparing the two uses I'm not
seeing anything that seems broken, though I do note that all the messages
below describe what the code provides as Time as
org.apache.spark.streaming.Time. 

a) Could the Time v org.apache.spark.streaming.Time difference be causing
this?  (I'm using Time the same in the first use, which appears to work
properly.) 

b) Any suggestions of what else could be causing the error?   

--code 
val ssc = new StreamingContext(conf, Seconds(timeSliceArg)) 
ssc.checkpoint(.) 

var lines = ssc.textFileStream(dirArg) 

var linesArray = lines.map( line = (line.split(\t))) 
var DnsSvr = linesArray.map( lineArray = ( 
 (lineArray(4), lineArray(5)), 
 (1 , Time((lineArray(0).toDouble*1000).toLong) ))  ) 

val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int,
Time)]) = { 
  val currentCount = if (values.isEmpty) 0 else values.map( x =
x._1).sum 
  val newMinTime = if (values.isEmpty) Time(Long.MaxValue) else
values.map( x = x._2).min 

  val (previousCount, minTime) = state.getOrElse((0,
Time(System.currentTimeMillis))) 

  (currentCount + previousCount, Seq(minTime, newMinTime).min) 
} 

var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount)  
// === error here 


--compilation output-- 
[error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method
value updateStateByKey with alternatives: 
[error]   (updateFunc: Iterator[((String, String), Seq[(Int,
org.apache.spark.streaming.Time)], Option[(Int,
org.apache.spark.streaming.Time)])] = Iterator[((String, String), (Int,
org.apache.spark.streaming.Time))],partitioner:
org.apache.spark.Partitioner,rememberPartitioner: Boolean)(implicit
evidence$5: scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] and
[error]   (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int,
org.apache.spark.streaming.Time)],partitioner:
org.apache.spark.Partitioner)(implicit evidence$4:
scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] and
[error]   (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int,
org.apache.spark.streaming.Time)],numPartitions: Int)(implicit evidence$3:
scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] and
[error]   (updateFunc: (Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = Option[(Int,
org.apache.spark.streaming.Time)])(implicit evidence$2:
scala.reflect.ClassTag[(Int,
org.apache.spark.streaming.Time)])org.apache.spark.streaming.dstream.DStream[((String,
String), (Int, org.apache.spark.streaming.Time))] 
[error]  cannot be applied to ((Seq[(Int, org.apache.spark.streaming.Time)],
Option[(Int, org.apache.spark.streaming.Time)]) = (Int,
org.apache.spark.streaming.Time)) 
[error] var DnsSvrCum = DnsSvr.updateStateByKey[(Int,
Time)](updateDnsCount) 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/overloaded-method-value-updateStateByKey-cannot-be-applied-to-when-Key-is-a-Tuple2-tp18644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SVMWithSGD default threshold

2014-11-11 Thread Caron
I'm hoping to get a linear classifier on a dataset.
I'm using SVMWithSGD to train the data.
After running with the default options: val model =
SVMWithSGD.train(training, numIterations), 
I don't think SVM has done the classification correctly.

My observations:
1. the intercept is always 0.0
2. the predicted labels are ALL 1's, no 0's.

My questions are:
1. what should the numIterations be? I tried to set it to
10*trainingSetSize, is that sufficient?
2. since MLlib only accepts data with labels 0 or 1, shouldn't the
default threshold for SVMWithSGD be 0.5 instead of 0.0?
3. It seems counter-intuitive to me to have the default intercept be 0.0,
meaning the line has to go through the origin.
4. Does Spark MLlib provide an API to do grid search like scikit-learn does?

Any help would be greatly appreciated!




-
Thanks!
-Caron
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to create a Graph in GraphX?

2014-11-11 Thread ankurdave
You should be able to construct the edges in a single map() call without
using collect():

val edges: RDD[Edge[String]] = sc.textFile(...).map { line =
  val row = line.split(,)
  Edge(row(0), row(1), row(2)
}
val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-create-a-Graph-in-GraphX-tp18635p18646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does spark can't work with HBase?

2014-11-11 Thread gzlj
hello,all
   I have tested reading Hbase table with spark1.1 using
SparkContext.newAPIHadoopRDD.I found the performance is much slower than
reading from HIVE.I also try read data using HFileScanner on one region
HFile,but the performance is not good.So,How do I improve performance spark
reading from Hbase? or, Can this illustrate spark do not work with HBase?
thanks for your help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-can-t-work-with-HBase-tp18647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Imbalanced shuffle read

2014-11-11 Thread ankits
Im running a job that uses groupByKey(), so it generates a lot of shuffle
data. Then it processes this and writes files to HDFS in a forEachPartition
block. Looking at the forEachPartition stage details in the web console, all
but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with a
huge shuffle read and takes a long time to finish. 

Can someone explain why the read is all on one node and how to parallelize
this better? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Imbalanced-shuffle-read-tp18648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ISpark class not found

2014-11-11 Thread Laird, Benjamin
I've been experimenting with the ISpark extension to IScala 
(https://github.com/tribbloid/ISpark)

Objects created in the REPL are not being loaded correctly on worker nodes, 
leading to a ClassNotFound exception. This does work correctly in spark-shell.

I was curious if anyone has used ISpark and has encountered this issue. Thanks!


Simple example:

In [1]: case class Circle(rad:Float)


In [2]: val rdd = sc.parallelize(1 to 1).map(i=Circle(i.toFloat)).take(10)

14/11/11 13:03:35 ERROR TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: Unable to find class: 
[L$line5.$read$$iwC$$iwC$Circle;


Full trace in my gist: 
https://gist.github.com/benjaminlaird/3e543a9a89fb499a3a14



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread alaa
Hey freedafeng, I'm exactly where you are. I want the output to show the
rowkey and all column qualifiers that correspond to it. How did you write
HBaseResultToStringConverter to do what you wanted it to do?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Help with processing multiple RDDs

2014-11-11 Thread Kapil Malik
Hi,

How is 78g distributed in driver, daemon, executor ?

Can you please paste the logs regarding  that I don't have enough memory to 
hold the data in memory
Are you collecting any data in driver ?

Lastly, did you try doing a re-partition to create smaller and evenly 
distributed partitions?

Regards,

Kapil 

-Original Message-
From: akhandeshi [mailto:ami.khande...@gmail.com] 
Sent: 12 November 2014 03:44
To: u...@spark.incubator.apache.org
Subject: Help with processing multiple RDDs

I have been struggling to process a set of RDDs.  Conceptually, it is is not a 
large data set. It seems, no matter how much I provide to JVM or partition, I 
can't seem to process this data.  I am caching the RDD.  I have tried 
persit(disk and memory), perist(memory) and persist(off_heap) with no success.  
Currently I am giving 78g to my driver, daemon and executor
memory.   

Currently, it seems to have trouble with one of the largest partition,
rdd_22_29 which is 25.9 GB.  

The metrics page shows Summary Metrics for 29 Completed Tasks.  However, I 
don't see few partitions on the list below.  However, i do seem to have 
warnings in the log file, indicating that I don't have enough memory to hold 
the data in memory.  I don't understand, what I am doing wrong or how I can 
troubleshoot. Any pointers will be appreciated...

14/11/11 21:28:45 WARN CacheManager: Not enough space to cache partition
rdd_22_20 in memory! Free memory is 17190150496 bytes.
14/11/11 21:29:27 WARN CacheManager: Not enough space to cache partition
rdd_22_13 in memory! Free memory is 17190150496 bytes.


Block Name  Storage Level   Size in Memory  Size on DiskExecutors
rdd_22_0Memory Deserialized 1x Replicated   2.1 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_10   Memory Deserialized 1x Replicated   7.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_11   Memory Deserialized 1x Replicated   1290.2 MB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_12   Memory Deserialized 1x Replicated   1167.7 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_14   Memory Deserialized 1x Replicated   3.8 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_15   Memory Deserialized 1x Replicated   4.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_16   Memory Deserialized 1x Replicated   2.4 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_17   Memory Deserialized 1x Replicated   37.6 MB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_18   Memory Deserialized 1x Replicated   120.9 MB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_19   Memory Deserialized 1x Replicated   755.9 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_2Memory Deserialized 1x Replicated   289.5 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_21   Memory Deserialized 1x Replicated   11.9 KB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_22   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_23   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_24   Memory Deserialized 1x Replicated   3.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_25   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_26   Memory Deserialized 1x Replicated   4.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_27   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_28   Memory Deserialized 1x Replicated   1846.1 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_29   Memory Deserialized 1x Replicated   25.9 GB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_3Memory Deserialized 1x Replicated   267.1 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_4Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_5Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_6Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_7Memory Deserialized 1x Replicated   14.8 KB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_8Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_9Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread Tobias Pfeiffer
Hi,

On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy hmx...@gmail.com wrote:

 If I run bin/spark-shell without connecting a master, it can access a hdfs
 file on a remote cluster with kerberos authentication.

[...]

However, if I start the master and slave on the same host and using
 bin/spark-shell --master spark://*.*.*.*:7077
 run the same commands

[... ]
 org.apache.hadoop.security.AccessControlException: Client cannot
 authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
 *.*.*.*.com/98.138.236.95; destination host is: *.*.*.*:8020;


When you give no master, it is local[*], so Spark will (implicitly?)
authenticate to HDFS from your local machine using local environment
variables, key files etc., I guess.

When you give a spark://* master, Spark will run on a different machine,
where you have not yet authenticated to HDFS, I think. I don't know how to
solve this, though, maybe some Kerberos token must be passed on to the
Spark cluster?

Tobias


Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Sean Owen
Yes although I think this difference is on purpose as part of that
commercial strategy. If future versions change license it would still be
possible to not upgrade. Or fork / recreate the bean classes. Not worried
so much but it is a good point.
On Nov 11, 2014 10:06 PM, DB Tsai dbt...@dbtsai.com wrote:

 I also worry about that the author of JPMML changed the license of
 jpmml-evaluator due to his interest of his commercial business, and he
 might change the license of jpmml-model in the future.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Tue, Nov 11, 2014 at 11:43 AM, Sean Owen so...@cloudera.com wrote:
  Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not;
 they're
  3-clause BSD:
 
  https://github.com/jpmml/jpmml-model
 
  So some of the scoring components are off-limits for an AL2 project but
 the
  core model components are OK.
 
  On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com wrote:
 
  JPMML evaluator just changed their license to AGPL or commercial
  license, and I think AGPL is not compatible with apache project. Any
  advice?
 
  https://github.com/jpmml/jpmml-evaluator
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Tue, Nov 11, 2014 at 10:07 AM, Xiangrui Meng men...@gmail.com
 wrote:
   Vincenzo sent a PR and included k-means as an example. Sean is helping
   review it. PMML standard is quite large. So we may start with simple
   model export, like linear methods, then move forward to tree-based.
   -Xiangrui
  
   On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com
 wrote:
   Hello Spark and MLLib folks,
  
   So a common problem in the real world of using machine learning is
 that
   some
   data analysis use tools like R, but the more data engineers out
 there
   will
   use more advanced systems like Spark MLLib or even Python Scikit
 Learn.
  
   In the real world, I want to have a system where multiple different
   modeling environments can learn from data / build models, represent
 the
   models in a common language, and then have a layer which just takes
 the
   model and run model.predict() all day long -- scores the models in
   other
   words.
  
   It looks like the project openscoring.io and jpmml-evaluator are
 some
   amazing systems for this, but they fundamentally use PMML as the
 model
   representation here.
  
   I have read some JIRA tickets that Xiangrui Meng is interested in
   getting
   PMML implemented to export MLLib models, is that happening? Further,
   would
   something like Manish Amde's boosted ensemble tree methods be
   representable
   in PMML?
  
   Thank you!!
   Aris
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
 



Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread ramblingpolak
You need to set the spark configuration property: spark.yarn.access.namenodes
to your namenode.

e.g. spark.yarn.access.namenodes=hdfs://mynamenode:8020

Similarly, I'm curious if you're also running high availability HDFS with an
HA nameservice.

I currently have HA HDFS and kerberos and I've noticed that I must set the
above property to the currently active namenode's hostname and port. Simply
using the HA nameservice to get delegation tokens does NOT seem to work with
Spark 1.1.0 (even though I can confirm the token is acquired).

I believe this may be a bug. Unfortunately simply adding both the active and
standby name nodes does not work as this actually causes an error. This
means that when my active name node fails over, my spark configuration
becomes invalid.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread ramblingpolak
Only YARN mode is supported with kerberos. You can't use a spark:// master
with kerberos.


Tobias Pfeiffer wrote
 When you give a spark://* master, Spark will run on a different machine,
 where you have not yet authenticated to HDFS, I think. I don't know how to
 solve this, though, maybe some Kerberos token must be passed on to the
 Spark cluster?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help with processing multiple RDDs

2014-11-11 Thread buring
i think you can try to set lower spark.storage.memoryFraction,for example 0.4
conf.set(spark.storage.memoryFraction,0.4)  //default 0.6



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLLIB usage: BLAS dependency warning

2014-11-11 Thread jpl
Hi,
I am having trouble using the BLAS libs with the MLLib functions.  I am
using org.apache.spark.mllib.clustering.KMeans (on a single machine) and
running the Spark-shell with the kmeans example code (from
https://spark.apache.org/docs/latest/mllib-clustering.html)  which runs
successfully but I get the following warning in the log:

WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS

I compiled spark 1.1.0 with mvn -Phadoop-2.4  -Dhadoop.version=2.4.0
-Pnetlib-lgpl -DskipTests clean package

If anyone could please clarify the steps to get the dependencies correctly
installed and visible to spark (from
https://spark.apache.org/docs/latest/mllib-guide.html), that would be
greatly appreciated.  Using yum, I installed blas.x86_64, lapack.x86_64,
gcc-gfortran.x86_64, libgfortran.x86_64 and then downloaded Breeze and built
that successfully with Maven.  I verified that I do have
/usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 present on the machine and
ldconf -p shows these listed.  

I also tried adding /usr/lib/ to spark.executor.extraLibraryPath and I
verified it is present in the Spark webUI environment tab.   I downloaded
and compiled jblas with mvn clean install, which creates
jblas-1.2.4-SNAPSHOT.jar, and then also tried adding that to
spark.executor.extraClassPath but I still get the same WARN message. Maybe
there are a few simple steps that I am missing?  Thanks a lot.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-usage-BLAS-dependency-warning-tp18660.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Help with processing multiple RDDs

2014-11-11 Thread Khandeshi, Ami
I am running as Local in client mode.  I have allocated as high as 85g to the 
driver, executor, and daemon.   When I look at java processes.  I see two.  I 
see
20974 SparkSubmitDriverBootstrapper
21650 Jps
21075 SparkSubmit
I have tried repartition before, but my understanding is that comes with an 
overhead.  In my previous attempt, I didn't achieve much success.  I am not 
clear, how to best get even partitions, any thoughts??

I am caching the RDD, and performing count on the keys.

I am running it again, with repartitioning on the dataset. Let us see if that 
helps!  I will send you the logs as soon as this completes!

Thank you,  I sincerely appreciate your help!

Regards,

Ami

-Original Message-
From: Kapil Malik [mailto:kma...@adobe.com] 
Sent: Tuesday, November 11, 2014 9:05 PM
To: akhandeshi; u...@spark.incubator.apache.org
Subject: RE: Help with processing multiple RDDs

Hi,

How is 78g distributed in driver, daemon, executor ?

Can you please paste the logs regarding  that I don't have enough memory to 
hold the data in memory
Are you collecting any data in driver ?

Lastly, did you try doing a re-partition to create smaller and evenly 
distributed partitions?

Regards,

Kapil 

-Original Message-
From: akhandeshi [mailto:ami.khande...@gmail.com] 
Sent: 12 November 2014 03:44
To: u...@spark.incubator.apache.org
Subject: Help with processing multiple RDDs

I have been struggling to process a set of RDDs.  Conceptually, it is is not a 
large data set. It seems, no matter how much I provide to JVM or partition, I 
can't seem to process this data.  I am caching the RDD.  I have tried 
persit(disk and memory), perist(memory) and persist(off_heap) with no success.  
Currently I am giving 78g to my driver, daemon and executor
memory.   

Currently, it seems to have trouble with one of the largest partition,
rdd_22_29 which is 25.9 GB.  

The metrics page shows Summary Metrics for 29 Completed Tasks.  However, I 
don't see few partitions on the list below.  However, i do seem to have 
warnings in the log file, indicating that I don't have enough memory to hold 
the data in memory.  I don't understand, what I am doing wrong or how I can 
troubleshoot. Any pointers will be appreciated...

14/11/11 21:28:45 WARN CacheManager: Not enough space to cache partition
rdd_22_20 in memory! Free memory is 17190150496 bytes.
14/11/11 21:29:27 WARN CacheManager: Not enough space to cache partition
rdd_22_13 in memory! Free memory is 17190150496 bytes.


Block Name  Storage Level   Size in Memory  Size on DiskExecutors
rdd_22_0Memory Deserialized 1x Replicated   2.1 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_10   Memory Deserialized 1x Replicated   7.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_11   Memory Deserialized 1x Replicated   1290.2 MB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_12   Memory Deserialized 1x Replicated   1167.7 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_14   Memory Deserialized 1x Replicated   3.8 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_15   Memory Deserialized 1x Replicated   4.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_16   Memory Deserialized 1x Replicated   2.4 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_17   Memory Deserialized 1x Replicated   37.6 MB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_18   Memory Deserialized 1x Replicated   120.9 MB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_19   Memory Deserialized 1x Replicated   755.9 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_2Memory Deserialized 1x Replicated   289.5 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_21   Memory Deserialized 1x Replicated   11.9 KB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_22   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_23   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_24   Memory Deserialized 1x Replicated   3.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_25   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_26   Memory Deserialized 1x Replicated   4.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_27   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_28   Memory Deserialized 1x Replicated   1846.1 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_29   Memory Deserialized 1x Replicated   25.9 GB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_3Memory Deserialized 1x Replicated   267.1 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_4Memory Deserialized 1x Replicated   24.0 B  0.0 B

Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
In spark-1.0.2, I have come across an error when I try to broadcast a quite
large numpy array(with 35M dimension). The error information except the
java.lang.NegativeArraySizeException error and details is listed below.
Moreover, when broadcast a relatively smaller numpy array(30M dimension),
everything works fine. And 30M dimension numpy array takes 230M memory
which, in my opinion, not very large.
As far as I have surveyed, it seems related with py4j. However, I have no
idea how to fix  this. I would be appreciated if I can get some hint.

py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
Trace:
java.lang.NegativeArraySizeException
at py4j.Base64.decode(Base64.java:292)
at py4j.Protocol.getBytes(Protocol.java:167)
at py4j.Protocol.getObject(Protocol.java:276)
at
py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
at py4j.commands.CallCommand.execute(CallCommand.java:77)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
-
And the test code is a follows:
conf =
SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')   
  
conf.set('spark.executor.memory', '4000m')  
conf.set('spark.akka.timeout', '10')
conf.set('spark.ui.port','8081')
conf.set('spark.cores.max','150')   
#conf.set('spark.rdd.compress', 'True') 
conf.set('spark.default.parallelism', '300')
#configure the spark environment
sc = SparkContext(conf=conf, batchSize=1)   

vec = np.random.rand(3500)  
a = sc.broadcast(vec)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Tobias Pfeiffer
Hi,

also there is Spindle https://github.com/adobe-research/spindle which was
introduced on this list some time ago. I haven't looked into it deeply, but
you might gain some valuable insights from their architecture, they are
also using Spark to fulfill requests coming from the web.

Tobias


External table partitioned by date using Spark SQL

2014-11-11 Thread ehalpern
I have large json files stored in S3 grouped under a sub-key for each year
like this:

I've defined an external table that's partitioned by year to keep the year
limited queries efficient.  The table definition looks like this:

But alas, a simple query like:

yields no results.  

If I remove the PARTITIONED BY clause and point directly at
s3://spark-data/logs/1980, everything works fine.  Am I running into a bug
or are range partitions not supported on external tables?   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/External-table-partitioned-by-date-using-Spark-SQL-tp18663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tobias Pfeiffer
Hi,

On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote:

 But getLang() is one of the methods of twitter4j.Status since version 3.0.6
 according to the doc at:
http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--

 What version of twitter4j does Spark Streaming use?


3.0.3
https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53

Tobias


Re: concat two Dstreams

2014-11-11 Thread Sean Owen
Concatenate? no. It doesn't make sense in this context to think about one
potentially infinite stream coming after another one. Do you just want the
union of batches from two streams? yes, just union(). You can union() with
non-streaming RDDs too.

On Tue, Nov 11, 2014 at 10:41 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Is it possible to concatenate or append two Dstreams together? I have an
 incoming stream that I wish to combine with data that's generated by a
 utility. I then need to process the combined Dstream.

 Thanks,
 Josh



Re: Converting Apache log string into map using delimiter

2014-11-11 Thread Sean Owen
I think it would be faster/more compact as:

z.map(_.map { element =
val tokens = element.split(=)
(tokens(0), tokens(1))
  }.toMap)

(That's probably 95% right but I didn't compile or test it.)

On Wed, Nov 12, 2014 at 12:18 AM, YaoPau jonrgr...@gmail.com wrote:

 OK I got it working with:

 z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element
 = element.split(=)(1))).toMap)

 But I'm guessing there is a more efficient way than to create two separate
 lists and then zip them together and then convert the result into a map.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Converting-Apache-log-string-into-map-using-delimiter-tp18641p18643.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Still struggling with building documentation

2014-11-11 Thread Sean Owen
(I don't think that's the same issue. This looks like some local problem
with tool installation?)

On Tue, Nov 11, 2014 at 9:56 PM, Patrick Wendell pwend...@gmail.com wrote:

 The doc build appears to be broken in master. We'll get it patched up
 before the release:

 https://issues.apache.org/jira/browse/SPARK-4326

 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta
 alexbare...@gmail.com wrote:
  Nichols and Patrick,
 
  Thanks for your help, but, no, it still does not work. The latest master
  produces the following scaladoc errors:
 
  [error]
 
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
  not found: type Type
  [error]   protected Type type() { return Type.UPLOAD_BLOCK; }
  [error] ^
  [error]
 
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
  not found: type Type
  [error]   protected Type type() { return Type.STREAM_HANDLE; }
  [error] ^
  [error]
 
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
  not found: type Type
  [error]   protected Type type() { return Type.OPEN_BLOCKS; }
  [error] ^
  [error]
 
 /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
  not found: type Type
  [error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
  [error] ^
 
  ...
 
  [error] four errors found
  [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
  [error] (spark/scalaunidoc:doc) Scaladoc generation failed
  [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM
  Moving back into docs dir.
  Making directory api/scala
  cp -r ../target/scala-2.10/unidoc/. api/scala
  Making directory api/java
  cp -r ../target/javaunidoc/. api/java
  Moving to python/docs directory and building sphinx.
  Makefile:14: *** The 'sphinx-build' command was not found. Make sure you
  have Sphinx installed, then set the SPHINXBUILD environment variable to
  point to the full path of the 'sphinx-build' executable. Alternatively
 you
  can add the directory with the executable to your PATH. If you don't have
  Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.
 
  Moving back into home dir.
  Making directory api/python
  cp -r python/docs/_build/html/. docs/api/python
  /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or
 directory -
  python/docs/_build/html/. (Errno::ENOENT)
  from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest'
  from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0'
  from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest'
  from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r'
  from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top
  (required)'
  from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
  from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
  from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup'
  from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each'
  from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup'
  from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize'
  from /usr/bin/jekyll:224:in `new'
  from /usr/bin/jekyll:224:in `main'
 
  What next?
 
  Alex
 
 
 
 
  On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
 
  I believe the web docs need to be built separately according to the
  instructions here.
 
  Did you give those a shot?
 
  It's annoying to have a separate thing with new dependencies in order to
  build the web docs, but that's how it is at the moment.
 
  Nick
 
  On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta 
 alexbare...@gmail.com
  wrote:
 
  I finally came to realize that there is a special maven target to build
  the scaladocs, although arguably a very unintuitive on: mvn verify. So
 now I
  have scaladocs for each package, but not for the whole spark project.
  Specifically, build/docs/api/scala/index.html is missing. Indeed the
 whole
  build/docs/api directory referenced in api.html is missing. How do I
 build
  it?
 
  Alex Baretta
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SVMWithSGD default threshold

2014-11-11 Thread Sean Owen
I think you need to use setIntercept(true) to get it to allow a non-zero
intercept. I also kind of agree that's not obvious or the intuitive default.

Is your data set highly imbalanced, with lots of positive examples? that
could explain why predictions are heavily skewed.

Iterations should definitely not be of the same order of magnitude as your
input, which could have millions of elements. 100 should be plenty as a
default.

Threshold is not related to the 0/1 labels in SVMs. It is a threshold on
the SVM margin. Margin is 0 at the decision boundary, not 0.5.

There's no grid search at this stage but it's easy to code up in a short
method.


On Wed, Nov 12, 2014 at 12:41 AM, Caron caron.big...@gmail.com wrote:

 I'm hoping to get a linear classifier on a dataset.
 I'm using SVMWithSGD to train the data.
 After running with the default options: val model =
 SVMWithSGD.train(training, numIterations),
 I don't think SVM has done the classification correctly.

 My observations:
 1. the intercept is always 0.0
 2. the predicted labels are ALL 1's, no 0's.

 My questions are:
 1. what should the numIterations be? I tried to set it to
 10*trainingSetSize, is that sufficient?
 2. since MLlib only accepts data with labels 0 or 1, shouldn't the
 default threshold for SVMWithSGD be 0.5 instead of 0.0?
 3. It seems counter-intuitive to me to have the default intercept be 0.0,
 meaning the line has to go through the origin.
 4. Does Spark MLlib provide an API to do grid search like scikit-learn
 does?

 Any help would be greatly appreciated!




 -
 Thanks!
 -Caron
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to solve this core dump error

2014-11-11 Thread shiwentao
I got a core dump when I used spark 1.1.0 . The enviorment is shown here.

software enviorment: 
  OS:  cent os 6.3
  jvm : Java HotSpot(TM) 64-Bit Server VM(build 21.0-b17,mixed mode) 
 
hardware enviorment:
 memory: 64G 


I run three spark process with jvm -Xmx args like this:
-Xmx 28G 
-Xmx 2G
-Xmx 14G 

The process which has core dump is with -Xmx 14G parameter. I found the core
dump happend when 
the IO  was  very high .

The gdb core dump bt data is like this : 

(gdb) bt
#0  0x003fc04328a5 in raise () from /lib64/libc.so.6
#1  0x003fc0434085 in abort () from /lib64/libc.so.6
#2  0x003fc046fa37 in __libc_message () from /lib64/libc.so.6
#3  0x003fc0475366 in malloc_printerr () from /lib64/libc.so.6
#4  0x2ad53773dad0 in Java_java_io_UnixFileSystem_getLength () from
/apps/svr/jdk-7/jre/lib/amd64/libjava.so
#5  0x2ad53c5c9b5c in ?? ()
#6  0x00049d9908e0 in ?? ()
#7  0x0004413d7030 in ?? ()
#8  0x0004498f77e0 in ?? ()
#9  0x2ad53c3dff18 in ?? ()
#10 0x0004498f77e0 in ?? ()
#11 0x2ad53c7c1668 in ?? ()
#12 0x0004413d7690 in ?? ()
#13 0x0004413d7678 in ?? ()
#14 0x in ?? ()
 

Aditionally ,I set the jvm parameter of spark executor like this : 

-XX:SurvivorRatio=8 -XX:MaxPermSize=1g -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/apps/logs/spark/ -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:SoftRefLRUPolicyMSPerMB=0
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSConcurrentMTEnabled
-XX:+DisableExplicitGC -Djava.net.preferIPv4Stack=true


Sencerely get  your help , Thanks 





  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-this-core-dump-error-tp18672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
This PR fix the problem: https://github.com/apache/spark/pull/2659

cc @josh

Davies

On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote:
 In spark-1.0.2, I have come across an error when I try to broadcast a quite
 large numpy array(with 35M dimension). The error information except the
 java.lang.NegativeArraySizeException error and details is listed below.
 Moreover, when broadcast a relatively smaller numpy array(30M dimension),
 everything works fine. And 30M dimension numpy array takes 230M memory
 which, in my opinion, not very large.
 As far as I have surveyed, it seems related with py4j. However, I have no
 idea how to fix  this. I would be appreciated if I can get some hint.
 
 py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
 Trace:
 java.lang.NegativeArraySizeException
 at py4j.Base64.decode(Base64.java:292)
 at py4j.Protocol.getBytes(Protocol.java:167)
 at py4j.Protocol.getObject(Protocol.java:276)
 at
 py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
 at py4j.commands.CallCommand.execute(CallCommand.java:77)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 -
 And the test code is a follows:
 conf =
 SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')
 conf.set('spark.executor.memory', '4000m')
 conf.set('spark.akka.timeout', '10')
 conf.set('spark.ui.port','8081')
 conf.set('spark.cores.max','150')
 #conf.set('spark.rdd.compress', 'True')
 conf.set('spark.default.parallelism', '300')
 #configure the spark environment
 sc = SparkContext(conf=conf, batchSize=1)

 vec = np.random.rand(3500)
 a = sc.broadcast(vec)






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu:

Thank you very much for your help. I will update that patch. By the way, as
I have succeed to broadcast an array of size(30M) the log said that such
array takes around 230MB memory. As a result, I think the numpy array that
leads to error is much smaller than 2G.

On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List]
ml-node+s1001560n18673...@n3.nabble.com wrote:

 This PR fix the problem: https://github.com/apache/spark/pull/2659

 cc @josh

 Davies

 On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=0 wrote:

  In spark-1.0.2, I have come across an error when I try to broadcast a
 quite
  large numpy array(with 35M dimension). The error information except the
  java.lang.NegativeArraySizeException error and details is listed below.
  Moreover, when broadcast a relatively smaller numpy array(30M
 dimension),
  everything works fine. And 30M dimension numpy array takes 230M memory
  which, in my opinion, not very large.
  As far as I have surveyed, it seems related with py4j. However, I have
 no
  idea how to fix  this. I would be appreciated if I can get some hint.
  
  py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
  Trace:
  java.lang.NegativeArraySizeException
  at py4j.Base64.decode(Base64.java:292)
  at py4j.Protocol.getBytes(Protocol.java:167)
  at py4j.Protocol.getObject(Protocol.java:276)
  at
  py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
  at py4j.commands.CallCommand.execute(CallCommand.java:77)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  -
  And the test code is a follows:
  conf =
  SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
  conf.set('spark.executor.memory', '4000m')
  conf.set('spark.akka.timeout', '10')
  conf.set('spark.ui.port','8081')
  conf.set('spark.cores.max','150')
  #conf.set('spark.rdd.compress', 'True')
  conf.set('spark.default.parallelism', '300')
  #configure the spark environment
  sc = SparkContext(conf=conf, batchSize=1)
 
  vec = np.random.rand(3500)
  a = sc.broadcast(vec)
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=1
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=2
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=4



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
My Homepage: www.cse.ust.hk/~bliuab
MPhil student in Hong Kong University of Science and Technology.
Clear Water Bay, Kowloon, Hong Kong.
Profile at LinkedIn http://www.linkedin.com/pub/liu-bo/55/52b/10b.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Read a HDFS file from Spark source code

2014-11-11 Thread rapelly kartheek
Hi

I am trying to access a file in HDFS from spark source code. Basically, I
am tweaking the spark source code. I need to access a file in HDFS from the
source code of the spark. I am really not understanding how to go about
doing this.

Can someone please help me out in this regard.
Thank you!!
Karthik


Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread hmxxyy
Thanks guys for the info.

I have to use yarn to access a kerberos cluster.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread Nick Pentreath
Feel free to add that converter as an option in the Spark examples via a PR :)

—
Sent from Mailbox

On Wed, Nov 12, 2014 at 3:27 AM, alaa contact.a...@gmail.com wrote:

 Hey freedafeng, I'm exactly where you are. I want the output to show the
 rowkey and all column qualifiers that correspond to it. How did you write
 HBaseResultToStringConverter to do what you wanted it to do?
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this
library works great on tweets https://github.com/carrotsearch/langid-java

On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote:

 But getLang() is one of the methods of twitter4j.Status since version
 3.0.6
 according to the doc at:
http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--

 What version of twitter4j does Spark Streaming use?


 3.0.3
 https://github.com/apache/spark/blob/master/external/twitter/pom.xml#L53

 Tobias




Re: Read a HDFS file from Spark source code

2014-11-11 Thread Samarth Mailinglist
Instead of a file path, use a HDFS URI.
For example: (In Python)



data = sc.textFile(hdfs://localhost/user/someuser/data)

​

On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi

 I am trying to access a file in HDFS from spark source code. Basically,
 I am tweaking the spark source code. I need to access a file in HDFS from
 the source code of the spark. I am really not understanding how to go about
 doing this.

 Can someone please help me out in this regard.
 Thank you!!
 Karthik



  1   2   >