Spark on Hadoop with Java 8

2014-08-27 Thread jatinpreet
Hi,

I am contemplating the use of Hadoop with Java 8 in a production system. I
will be using Apache Spark for doing most of the computations on data stored
in HBase.

Although Hadoop seems to support JDK 8 with some tweaks, the official HBase
site states the following for version 0.98,

Running with JDK 8 works but is not well tested. Building with JDK 8 would
require removal of the deprecated remove() method of the PoolMap class and
is under consideration. See ee HBASE-7608 for more information about JDK 8
support.

I am inclined towards using JDK 8 specifically for the support of lambda
expressions which will take a lot of verbosity out of my Spark
programs(Scala learning curve is a deterrent for me as a possible bottleneck
for future talent acquisition).

Is it a good idea to use Spark/Hadoop/HBase combo with Java 8 at the moment?

Thanks,
Jatin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Hadoop-with-Java-8-tp12883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Upgrading 1.0.0 to 1.0.2

2014-08-27 Thread Victor Tso-Guillen
Ah, thanks.


On Tue, Aug 26, 2014 at 7:32 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, Victor,

 the issue for you to have different version in driver and cluster is that
 you the master will shutdown your application due to the inconsistent
 SerialVersionID in ExecutorState

 Best,

 --
 Nan Zhu

 On Tuesday, August 26, 2014 at 10:10 PM, Matei Zaharia wrote:

 Things will definitely compile, and apps compiled on 1.0.0 should even be
 able to link against 1.0.2 without recompiling. The only problem is if you
 run your driver with 1.0.0 on its classpath, but the cluster has 1.0.2 in
 executors.

 For Mesos and YARN vs standalone, the difference is that they just have
 more features, at the expense of more complicated setup. For example, they
 have richer support for cross-application sharing (see
 https://spark.apache.org/docs/latest/job-scheduling.html), and the
 ability to run non-Spark applications on the same cluster.

 Matei

 On August 26, 2014 at 6:53:33 PM, Victor Tso-Guillen (v...@paxata.com)
 wrote:

 Yes, we are standalone right now. Do you have literature why one would
 want to consider Mesos or YARN for Spark deployments?

 Sounds like I should try upgrading my project and seeing if everything
 compiles without modification. Then I can connect to an existing 1.0.0
 cluster and see what what happens...

 Thanks, Matei :)


 On Tue, Aug 26, 2014 at 6:37 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Is this a standalone mode cluster? We don't currently make this
 guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem though
 is that the standalone mode grabs the executors' version of Spark code from
 what's installed on the cluster, while your driver might be built against
 another version. On YARN and Mesos, you can more easily mix different
 versions of Spark, since each application ships its own Spark JAR (or
 references one from a URL), and this is used for both the driver and
 executors.

  Matei

 On August 26, 2014 at 6:10:57 PM, Victor Tso-Guillen (v...@paxata.com)
 wrote:

  I wanted to make sure that there's full compatibility between minor
 releases. I have a project that has a dependency on spark-core so that it
 can be a driver program and that I can test locally. However, when
 connecting to a cluster you don't necessarily know what version you're
 connecting to. Is a 1.0.0 cluster binary compatible with a 1.0.2 driver
 program? Is a 1.0.0 driver program binary compatible with a 1.0.2 cluster?






RE: What is a Block Manager?

2014-08-27 Thread Liu, Raymond
The framework have those info to manage cluster status, and these info (e.g. 
worker number) is also available through spark metrics system.
While from the user application's point of view, can you give an example why 
you need these info, what would you plan to do with them?

Best Regards,
Raymond Liu

From: Victor Tso-Guillen [mailto:v...@paxata.com] 
Sent: Wednesday, August 27, 2014 1:40 PM
To: Liu, Raymond
Cc: user@spark.apache.org
Subject: Re: What is a Block Manager?

We're a single-app deployment so we want to launch as many executors as the 
system has workers. We accomplish this by not configuring the max for the 
application. However, is there really no way to inspect what machines/executor 
ids/number of workers/etc is available in context? I'd imagine that there'd be 
something in the SparkContext or in the listener, but all I see in the listener 
is block managers getting added and removed. Wouldn't one care about the 
workers getting added and removed at least as much as for block managers?

On Tue, Aug 26, 2014 at 6:58 PM, Liu, Raymond raymond@intel.com wrote:
Basically, a Block Manager manages the storage for most of the data in spark, 
name a few: block that represent a cached RDD partition, intermediate shuffle 
data, broadcast data etc. it is per executor, while in standalone mode, 
normally, you have one executor per worker.

You don't control how many worker you have at runtime, but you can somehow 
manage how many executors your application will launch  Check different running 
mode's documentation for details  ( but control where? Hardly, yarn mode did 
some works based on data locality, but this is done by framework not user 
program).

Best Regards,
Raymond Liu

From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Tuesday, August 26, 2014 11:42 PM
To: user@spark.apache.org
Subject: What is a Block Manager?

I'm curious not only about what they do, but what their relationship is to the 
rest of the system. I find that I get listener events for n block managers 
added where n is also the number of workers I have available to the 
application. Is this a stable constant?

Also, are there ways to determine at runtime how many workers I have and where 
they are?

Thanks,
Victor



Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Antonio Jesus Navarro
Maybe this would interest you:

CPU and GPU-accelerated Machine Learning Library:

https://github.com/BIDData/BIDMach


2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:

 You should try to find a Java-based library, then you can call it from
 Scala.

 Matei

 On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote:

 Hi I am trying to find a CUDA library in Scala, to see if some matrix
 manipulation in MLlib can be sped up.

 I googled a few but found no active projects on Scala+CUDA. Python is
 supported by CUDA though. Any suggestion on whether this idea makes any
 sense?

 Best regards,
 Wei




Is there a way to insert data into existing parquet file using spark ?

2014-08-27 Thread rafeeq s
Hi,

*Is there a way to insert data into existing parquet file using spark ?*

I am using spark stream and spark sql to store store real time data into
parquet files and then query it using impala.

spark creating multiple sub directories of parquet files and it make me
challenge while loading it to impala. I want to insert data into existing
parquet file instead of creating new parquet file.

I have tried with INSERT statement but it makes performance too slow.

Please suggest is there any way to insert or append data into existing
parquet file.



Regards,

Rafeeq S
*(“What you do is what matters, not what you think or say or plan.” )*


Re: Spark Streaming Output to DB

2014-08-27 Thread Akhil Das
Like Mayur said, its better to use mapPartition instead of map.

Here's a piece of code which typically reads a text file and inserts each
raw into the database. I haven't tested it, It might throw up some
Serialization errors, In that case, you gotta serialize them!

JavaRDDString txtRDD = jsc.textFile(/sigmoid/logs.cap);

 JavaRDDString dummyRDD = txtRDD.map(new FunctionString,String(){
 @Override
 public String call(String raw) throws Exception {

 * //Creating JDBC Connection*
 Class.forName(oracle.jdbc.driver.OracleDriver);
 Driver myDriver = new oracle.jdbc.driver.OracleDriver();
 DriverManager.registerDriver( myDriver);
 String URL = jdbc:oracle:thin:AkhlD/password@akhldz:1521:EMP;
 Connection conn = DriverManager.getConnection(URL);
 Statement stmt = conn.createStatement();


 * //Do some operations!!* stmt.executeQuery(INSERT INTO logs
 VALUES(null, ' + raw + ');


 * //Now close the connection* conn.close();
 return raw +  [ Added ];
 }
 });


Thanks
Best Regards


On Wed, Aug 27, 2014 at 10:06 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 I would suggest you to use JDBC connector in mappartition instead of maps
 as JDBC connections are costly  can really impact your performance.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Tue, Aug 26, 2014 at 6:45 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Yes, you can open a jdbc connection at the beginning of the map method
 then close this connection at the end of map() and in between you can use
 this connection.

 Thanks
 Best Regards


 On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma raviprincesha...@gmail.com
 wrote:

 Hello People,

  I'm using java spark streaming. I'm just wondering, Can I make simple
 jdbc connection in JavaDStream map() method?

 Or

 Do  I need to create jdbc connection for each JavaPairDStream, after
 map task?

 Kindly give your thoughts.


 Cheers,
 Ravi Sharma







Developing a spark streaming application

2014-08-27 Thread Filip Andrei
Hey guys, so the problem i'm trying to tackle is the following:

- I need a data source that emits messages at a certain frequency
- There are N neural nets that need to process each message individually
- The outputs from all neural nets are aggregated and only when all N
outputs for each message are collected, should a message be declared fully
processed
- At the end i should measure the time it took for a message to be fully
processed (time between when it was emitted and when all N neural net
outputs from that message have been collected)


What i'm mostly interested in is if i approached the problem correctly in
the first place and if so some best practice pointers on my approach.






And my current implementation if the following:


For a data source i created the class
public class JavaRandomReceiver extends ReceiverMaplt;String, Object

As i decided a key-value store would be best suited to holding emitted data.


The onStart() method initializes a custom random sequence generator and
starts a thread that
continuously generates new neural net inputs and stores them as following:

SensorData sdata = generator.createSensorData();

MapString, Object result = new HashMapString, Object();

result.put(msgNo, sdata.getMsgNo());
result.put(sensorTime, sdata.getSampleTime());
result.put(list, sdata.getPayload());
result.put(timeOfProc, sdata.getCreationTime());

store(result);

// sleeps for a given amount of time set at generator creation
generator.waitForNextTuple();

The msgNo here is incremented for each newly created message and is used to
keep 


The neural net functionality is added by creating a custom mapper
public class NeuralNetMapper implements FunctionMaplt;String, Object,
MapString, Object

whose call function basically just takes the input map, plugs its list
object as the input to the neural net object, replaces the map's initial
list with the neural net output and returns the modified map.




The aggregator is implemented as a single class that has the following form

public class JavaSyncBarrier implements
FunctionJavaRDDlt;Maplt;String,Object, Void



This class maintains a google guava cache of neural net outputs that it has
received in the form of
Long, Listlt;Maplt;String, Object, where the Long value is the msgNo
and the list contains all maps containing said message number.

When a new map is received, it is added to the cache, its list's length is
compared to to the total number of neural nets and, if these numbers match,
that message number is said to be fully processed and a difference between
timeOfProc (all maps with the same msgNo have the same timeOfProc) and the
current system time is displayed as the total time necessary for processing.





Now the way all these components are linked together is the following:

public static void main(String[] args) {


SparkConf conf = new SparkConf();
conf.setAppName(SimpleSparkStreamingTest);


JavaStreamingContext jssc = new JavaStreamingContext(conf, new
Duration(1000));

jssc.checkpoint(/tmp/spark-tempdir);

// Generator config goes here
// Set to emit new message every 1 second
// ---

// Neural net config goes here
// ---  

JavaReceiverInputDStreamMaplt;String, Object rndLists = jssc
.receiverStream(new JavaRandomReceiver(generatorConfig);

ListJavaDStreamlt;Maplt;String, Object neuralNetOutputStreams = 
new
ArrayListJavaDStreamlt;Maplt;String, Object();

for(int i = 0; i  numberOfNets; i++){

neuralNetOutputStreams .add(
rndLists.map(new NeuralNetMapper(neuralNetConfig))
);
}

JavaDStreamMaplt;String, Object joined = 
joinStreams(neuralNetOutputs);

joined.foreach(new JavaSyncBarrier(numberOfNets));

jssc.start();
jssc.awaitTermination();
}

where joinStreams unifies a list of streams:
public static T JavaDStreamT joinStreams(ListJavaDStreamlt;T
streams) {

JavaDStreamT result = streams.get(0);
for (int i = 1; i  streams.size(); i++) {
result = result.union(streams.get(i));
}

return result;
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Developing-a-spark-streaming-application-tp12893.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



hive on spark yarn

2014-08-27 Thread centerqi hu
Hi all


When I run a simple SQL, encountered the following error.

hive:0.12(metastore in mysql)

hadoop 2.4.1

spark 1.0.2 build with hive


my hql code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.hive.LocalHiveContext
object HqlTest {
  case class Record(key: Int, value: String)
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(HiveFromSpark)
val sc = new SparkContext(sparkConf)
  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  import hiveContext._
hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
  }

}






14/08/27 16:07:08 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
14/08/27 16:07:08 INFO ObjectStore: ObjectStore, initialize called
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table src
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:189)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:163)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:250)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:250)
at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75)
at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78)
at HqlTest$.main(HqlTest.scala:15)
at HqlTest.main(HqlTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1210)
... 31 more
Caused by: javax.jdo.JDOFatalUserException: Class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304)
at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:234)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:209)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at 

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-27 Thread BertrandR
Thank you for your answers, and sorry for my lack of understanding.
So I tried what you suggested, with/without unpersisting and with .cache()
(also persist(StorageLevel.MEMORY_AND_DISK) but this is not allowed for msg
because you can't change the Storage level apparently) for msg, g and
newVerts, but it gave the exact same result mentionned in my previous post.

For example before failing, in one iteration I get for the innerJoin task
the following log :

14/08/27 10:30:58 INFO executor.Executor: Running task ID 37
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_0 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_1 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_2 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_3 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_6 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_7 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_4 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_10
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_11
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_8 locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_14
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_15
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_12
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_18
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_19
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block broadcast_16
locally
14/08/27 10:30:58 INFO storage.BlockManager: Found block rdd_122_0 locally
14/08/27 10:30:58 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
50331648, targetRequestSize: 10066329
14/08/27 10:30:58 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty
blocks out of 1 blocks
14/08/27 10:30:58 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote
fetches in 0 ms

If I understand this correctly, this means that it is looking for the whole
list of broadcasted variables, even though it should only need the 3 current
values... I'm a little confused about this.

For the sake of completeness, here is the same portion of code I am using
with cache() et unpersist removed (I tried multiple combinations, I also
tried without broadcasting variables (e.g. directly feeding the innerJoin
with the value instead of the broadcast) :

  def run(graph : Graph[Long,Long],m : Long)(implicit sc : SparkContext) = 
{
val spd = SparsePD
// Init fusinoMap
var fusionMap = Map[Long, Long]().withDefault(x = x)
// Init tots
val tots = Map[Long, Double]().withDefaultValue(1.0)
var totBcst = sc.broadcast(tots)
var fusionBcst = sc.broadcast(fusionMap)
val mC = sc.broadcast(m)
// Initial distributions
var g = graph.mapVertices({ case (vid, deg) =
VertexProp(somedistrib.withDefaultValue(0.0), deg) })
var newVerts = g.vertices
//Initial messages
var msg = g.mapReduceTriplets(MFExecutor.sendMsgMF,
MFExecutor.mergeMsgMF)
var iter = 0

while (iter  20) {
  // Messages
  val oldMessages = msg
  val oldVerts = newVerts
  newVerts =
newVerts.innerJoin(msg)(MFExecutor.vprogMF(mC,totBcst,fusionBcst)).persist(StorageLevel.MEMORY_AND_DISK)
  newVerts.checkpoint() // Tue la lineage de newVerts ?
  newVerts.count() // Matérialise les deux lignes précédentes
  val prevG = g
  g = graph.outerJoinVertices(newVerts)({case (vid,deg,newOpt) =
newOpt.getOrElse(VertexProp(Map(vid - 1.0).withDefaultValue(0.0),
deg))}).cache()
  //g = g.outerJoinVertices(newVerts)({case (vid,old,newOpt) =
newOpt.getOrElse(old)})

  // 1st global var
  val fusionAcc = sc.accumulable[Map[Long, Long], (Long,
Long)](fusionMap)(FusionAccumulable)
  g.triplets.filter(tp = testEq(fusionBcst)(tp.srcId,tp.dstId)
(spd.dotPD(tp.dstAttr.prob, tp.srcAttr.prob)  0.9)).foreach(tp = 
fusionAcc += (tp.dstId, tp.srcId))
  //fusionBcst.unpersist(blocking = false)
  fusionMap = fusionAcc.value
  fusionBcst = sc.broadcast(fusionMap)

  //2nd global var
  val totAcc = sc.accumulator[Map[Long, Double]](Map[Long,
Double]().withDefaultValue(0.0))(TotAccumulable)
  newVerts.foreach({ case (vid, vprop) = totAcc +=
vprop.prob.mapValues(p = p * vprop.deg).withDefaultValue(0.0)})
  //totBcst.unpersist(blocking = false)
  totBcst = sc.broadcast(totAcc.value)

  // New messages
  msg = g.mapReduceTriplets(MFExecutor.sendMsgMF,
MFExecutor.mergeMsgMF).cache()

  // Unpersist options
  //oldMessages.unpersist(blocking = false)
  //oldVerts.unpersist(blocking=false)
  //prevG.unpersistVertices(blocking=false)

  iter = iter + 1
}

Example File not running

2014-08-27 Thread Hingorani, Vineet
Hello all,



I am able to use Spark in the shell but I am not able to run a spark file. I am 
using sbt and the jar is created but even the SimpleApp class example given on 
the site http://spark.apache.org/docs/latest/quick-start.html is not running. I 
installed a prebuilt version of spark and sbt package is compiling the scala 
file to jar. I am running it locally on my machine. The error log is huge but 
it starts with something like this:



14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/sp

ark/log4j-defaults.properties

14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls

disabled; users with view permissions: Set(D062844)

14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

14/08/27 12:14:22 INFO Remoting: Starting remoting

14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spa

rk@10.94.74.159:51157]

14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@10.9

4.74.159:51157]

14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at 
C:\Users\D062844\AppDa

ta\Local\Temp\spark-local-20140827121422-dec8

14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.

14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = 
ConnectionM

anagerId(10.94.74.159,51160)

14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager

14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 
10.94.74.159:51160 with

294.9 MB RAM

14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at 
http://10.94.74.159:5116

1

14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is 
C:\Users\D062844\AppD

ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040

14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your pla

tform... using builtin-java classes where applicable

14/08/27 12:14:23 INFO SparkContext: Added JAR 
file:/C:/Users/D062844/Desktop/HandsOnSpark

/Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar 
at http://1

0.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp 1409134463198

14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with 
curMem=0, maxMem=3

09225062

14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to 
memory (estimate

d size 135.5 KB, free 294.8 MB)

14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary p

ath

java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binar

ies.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)

at org.apache.hadoop.util.Shell.clinit(Shell.java:293)

at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)

at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362

)

at 
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

at 
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



at scala.Option.map(Option.scala:145)

at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.rdd.FilteredRDD.getPartitions(FilteredRDD.scala:29)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)



..

.





It goes on like this and doesn't show me the result of count in the file. I had 

Replicate RDDs

2014-08-27 Thread rapelly kartheek
Hi

I have a three node spark cluster. I restricted the resources per
application by setting appropriate parameters and I could run two
applications simultaneously. Now, I want to replicate an RDD and run two
applications simultaneously. Can someone help how to go about doing this!!!

I replicated an RDD of size 1354MB over this cluster. The webUI shows that
its replicated twice. But when I go to storage details, the two partitions,
each of size ~677MB, are stored on the same node. All other nodes do not
contain any partitions.

Can someone tell me where am I going wrong?

Thank you!!
-karthik


Re: Example File not running

2014-08-27 Thread Akhil Das
The statement java.io.IOException: Could not locate executable
null\bin\winutils.exe

explains that the null is received when expanding or replacing an
Environment Variable.

I'm guessing that you are missing *HADOOP_HOME* in the environment
variables.

Thanks
Best Regards


On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet vineet.hingor...@sap.com
 wrote:

  Hello all,



 I am able to use Spark in the shell but I am not able to run a spark file.
 I am using sbt and the jar is created but even the SimpleApp class example
 given on the site http://spark.apache.org/docs/latest/quick-start.html is
 not running. I installed a prebuilt version of spark and sbt package is
 compiling the scala file to jar. I am running it locally on my machine. The
 error log is huge but it starts with something like this:



 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j
 profile: org/apache/sp

 ark/log4j-defaults.properties

 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls

 disabled; users with view permissions: Set(D062844)

 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

 14/08/27 12:14:22 INFO Remoting: Starting remoting

 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spa

 rk@10.94.74.159:51157]

 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@10.9

 4.74.159:51157]

 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at
 C:\Users\D062844\AppDa

 ta\Local\Temp\spark-local-20140827121422-dec8

 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity
 294.9 MB.

 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with
 id = ConnectionM

 anagerId(10.94.74.159,51160)

 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager

 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager
 10.94.74.159:51160 with

 294.9 MB RAM

 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager

 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at
 http://10.94.74.159:5116

 1

 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is
 C:\Users\D062844\AppD

 ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10

 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at
 http://10.94.74.159:4040

 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your pla

 tform... using builtin-java classes where applicable

 14/08/27 12:14:23 INFO SparkContext: Added JAR
 file:/C:/Users/D062844/Desktop/HandsOnSpark

 /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar
 at http://1

 0.94.74.159:51162/jars/simple-project_2.10-1.0.jar with timestamp
 1409134463198

 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with
 curMem=0, maxMem=3

 09225062

 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to
 memory (estimate

 d size 135.5 KB, free 294.8 MB)

 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the
 hadoop binary p

 ath

 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binar

 ies.

 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)

 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)

 at org.apache.hadoop.util.Shell.clinit(Shell.java:293)

 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)

 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362

 )

 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 at scala.Option.map(Option.scala:145)

 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145)

 at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

 at
 

RE: Installation On Windows machine

2014-08-27 Thread Mishra, Abhishek
Thank you for the reply Matei,

Is there something which we missed. ? I am able to run the spark instance on my 
local system i.e. Windows 7 but the same set of steps do not allow me to run it 
on  Windows server 2012 machine. The black screen just appears for a fraction 
of second and disappear, I am unable to debug the same. Please guide me.

Thanks,

Abhishek


-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Saturday, August 23, 2014 9:47 AM
To: Mishra, Abhishek
Cc: user@spark.apache.org
Subject: Re: Installation On Windows machine

You should be able to just download / unzip a Spark release and run it on a 
Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. 
The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on 
Windows, but you can launch a standalone cluster manually using

bin\spark-class org.apache.spark.deploy.master.Master

and

bin\spark-class org.apache.spark.deploy.worker.Worker spark://master:port

For submitting jobs to YARN instead of the standalone cluster, spark-submit.cmd 
*may* work but I don't think we've tested it heavily. If you find issues with 
that, please let us know. But overall the instructions should be the same as on 
Linux, except you use the .cmd scripts instead of the .sh ones.

Matei

On Aug 22, 2014, at 3:01 AM, Mishra, Abhishek abhishek.mis...@xerox.com wrote:

 Hello Team,
  
 I was just trying to install spark on my windows server 2012 machine and use 
 it in my project; but unfortunately I do not find any documentation for the 
 same. Please let me know if we have drafted anything for spark users on 
 Windows. I am really in need of it as we are using Windows machine for Hadoop 
 and other tools and so cannot move back to Linux OS or anything. We run 
 Hadoop on hortonworks HDP2.0  platform and also recently I came across Spark 
 and so wanted use this even in my project for my Analytics work. Please 
 suggest me links or documents where I can move ahead with my installation and 
 usage. I want to run it on Java.
  
 Looking forward for a reply,
  
 Thanking you in Advance,
 Sincerely,
 Abhishek
  
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
What should I put the value of that environment variable? I want to run the 
scripts locally on my machine and do not have any Hadoop installed.



Thank you



From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Mittwoch, 27. August 2014 12:54
To: Hingorani, Vineet
Cc: user@spark.apache.org
Subject: Re: Example File not running



The statement java.io.IOException: Could not locate executable 
null\bin\winutils.exe



explains that the null is received when expanding or replacing an Environment 
Variable.



I'm guessing that you are missing HADOOP_HOME in the environment variables.




Thanks

Best Regards



On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:

Hello all,



I am able to use Spark in the shell but I am not able to run a spark file. I am 
using sbt and the jar is created but even the SimpleApp class example given on 
the site http://spark.apache.org/docs/latest/quick-start.html is not running. I 
installed a prebuilt version of spark and sbt package is compiling the scala 
file to jar. I am running it locally on my machine. The error log is huge but 
it starts with something like this:



14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/sp

ark/log4j-defaults.properties

14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls

disabled; users with view permissions: Set(D062844)

14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

14/08/27 12:14:22 INFO Remoting: Starting remoting

14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@10.94.74.159:51157http://rk@10.94.74.159:51157]

14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@10.9

4.74.159:51157]

14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at 
C:\Users\D062844\AppDa

ta\Local\Temp\spark-local-20140827121422-dec8

14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.

14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = 
ConnectionM

anagerId(10.94.74.159,51160)

14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager

14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 
10.94.74.159:51160http://10.94.74.159:51160 with

294.9 MB RAM

14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at 
http://10.94.74.159:5116

1

14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is 
C:\Users\D062844\AppD

ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040

14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your pla

tform... using builtin-java classes where applicable

14/08/27 12:14:23 INFO SparkContext: Added JAR 
file:/C:/Users/D062844/Desktop/HandsOnSpark

/Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar 
at 
http://10.94.74.159:51162/jars/simple-project_2.10-1.0.jarhttp://0.94.74.159:51162/jars/simple-project_2.10-1.0.jar
 with timestamp 1409134463198

14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with 
curMem=0, maxMem=3

09225062

14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to 
memory (estimate

d size 135.5 KB, free 294.8 MB)

14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary p

ath

java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binar

ies.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)

at org.apache.hadoop.util.Shell.clinit(Shell.java:293)

at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)

at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362

)

at 
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

at 
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



at scala.Option.map(Option.scala:145)

at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)

at 

External dependencies management with spark

2014-08-27 Thread Jaonary Rabarisoa
Dear all,

I'm looking for an efficient way to manage external dependencies. I know
that one can add .jar or .py dependencies easily but how can I handle other
type of dependencies. Specifically, I have some data processing algorithm
implemented with other languages (ruby, octave, matlab, c++) and I need to
put all of these libraries component (set of .rb, .m, .so files) on my
workers in order to able to call them as external process.
What is the best way to achieve this ?

Cheers,


Jaonary


Re: Example File not running

2014-08-27 Thread Akhil Das
It should point to your hadoop installation directory. (like C:\hadoop\)

Since you don't have hadoop installed, What is the code that you are
running?

Thanks
Best Regards


On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet vineet.hingor...@sap.com
 wrote:

  What should I put the value of that environment variable? I want to run
 the scripts locally on my machine and do not have any Hadoop installed.



 Thank you



 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Mittwoch, 27. August 2014 12:54
 *To:* Hingorani, Vineet
 *Cc:* user@spark.apache.org
 *Subject:* Re: Example File not running



 The statement java.io.IOException: Could not locate executable
 null\bin\winutils.exe



 explains that the null is received when expanding or replacing an
 Environment Variable.



 I'm guessing that you are missing *HADOOP_HOME* in the environment
 variables.


   Thanks

 Best Regards



 On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet 
 vineet.hingor...@sap.com wrote:

 Hello all,



 I am able to use Spark in the shell but I am not able to run a spark file.
 I am using sbt and the jar is created but even the SimpleApp class example
 given on the site http://spark.apache.org/docs/latest/quick-start.html is
 not running. I installed a prebuilt version of spark and sbt package is
 compiling the scala file to jar. I am running it locally on my machine. The
 error log is huge but it starts with something like this:



 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j
 profile: org/apache/sp

 ark/log4j-defaults.properties

 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls

 disabled; users with view permissions: Set(D062844)

 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

 14/08/27 12:14:22 INFO Remoting: Starting remoting

 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spark@10.94.74.159:51157]

 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@10.9

 4.74.159:51157]

 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at
 C:\Users\D062844\AppDa

 ta\Local\Temp\spark-local-20140827121422-dec8

 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity
 294.9 MB.

 14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with
 id = ConnectionM

 anagerId(10.94.74.159,51160)

 14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager

 14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager
 10.94.74.159:51160 with

 294.9 MB RAM

 14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager

 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

 14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at
 http://10.94.74.159:5116

 1

 14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is
 C:\Users\D062844\AppD

 ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10

 14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

 14/08/27 12:14:22 INFO SparkUI: Started SparkUI at
 http://10.94.74.159:4040

 14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your pla

 tform... using builtin-java classes where applicable

 14/08/27 12:14:23 INFO SparkContext: Added JAR
 file:/C:/Users/D062844/Desktop/HandsOnSpark

 /Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar
 at http://10.94.74.159:51162/jars/simple-project_2.10-1.0.jar with
 timestamp 1409134463198

 14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with
 curMem=0, maxMem=3

 09225062

 14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to
 memory (estimate

 d size 135.5 KB, free 294.8 MB)

 14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the
 hadoop binary p

 ath

 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binar

 ies.

 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)

 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)

 at org.apache.hadoop.util.Shell.clinit(Shell.java:293)

 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)

 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362

 )

 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 at scala.Option.map(Option.scala:145)

 at 

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
Thank you Matei.

 I found a solution using pipe and matlab engine (an executable that can
call matlab behind the scene and uses stdin and stdout to communicate). I
just need to fix two other issues :

- how can I handle my dependencies ? My matlab script need other matlab
files that need to be present on each workers' matlab path. So I need a way
to push them to each worker and tell matlab where to find them with
addpath. I know how to call addpath but I don't know what should be the
path.

- is the pipe() operator works on a partition level in order to run the
external process once for each data in a partition. Initializing my
external process cost a lot so it is not good to call it several times.



On Mon, Aug 25, 2014 at 9:03 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Have you tried the pipe() operator? It should work if you can launch your
 script from the command line. Just watch out for any environment variables
 needed (you can pass them to pipe() as an optional argument if there are
 some).

 On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa (jaon...@gmail.com)
 wrote:

 Hi all,

 Is there someone that tried to pipe RDD into matlab script ? I'm trying to
 do something similiar if one of you could point some hints.

 Best regards,

 Jao




RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
The code is the example given on Spark site:



/* SimpleApp.scala */

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf



object SimpleApp {

  def main(args: Array[String]) {

val logFile = 
C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md
 // Should be some file on your system

val conf = new SparkConf().setAppName(Simple Application)

val sc = new SparkContext(conf)

val logData = sc.textFile(logFile, 2).cache()

val numAs = logData.filter(line = line.contains(a)).count()

val numBs = logData.filter(line = line.contains(b)).count()

println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))

  }

}



It goes on like this and doesn’t show me the result of count in the file. I had 
installed pre-built version of Spark1.0.2 named Hadoop2 version Spark on the 
site. The problem what I think is because I am running it on local machine and 
it is not able to find some dependencies of Hadoop. Please tell me what file 
should I download to work on my local machine (pre-built, so that I don’t have 
to build it again).









From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Mittwoch, 27. August 2014 13:35
To: Hingorani, Vineet
Cc: user@spark.apache.org
Subject: Re: Example File not running



It should point to your hadoop installation directory. (like C:\hadoop\)



Since you don't have hadoop installed, What is the code that you are running?




Thanks

Best Regards



On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:

What should I put the value of that environment variable? I want to run the 
scripts locally on my machine and do not have any Hadoop installed.



Thank you



From: Akhil Das 
[mailto:ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com]
Sent: Mittwoch, 27. August 2014 12:54
To: Hingorani, Vineet
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Example File not running



The statement java.io.IOException: Could not locate executable 
null\bin\winutils.exe



explains that the null is received when expanding or replacing an Environment 
Variable.



I'm guessing that you are missing HADOOP_HOME in the environment variables.




Thanks

Best Regards



On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:

Hello all,



I am able to use Spark in the shell but I am not able to run a spark file. I am 
using sbt and the jar is created but even the SimpleApp class example given on 
the site http://spark.apache.org/docs/latest/quick-start.html is not running. I 
installed a prebuilt version of spark and sbt package is compiling the scala 
file to jar. I am running it locally on my machine. The error log is huge but 
it starts with something like this:



14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/sp

ark/log4j-defaults.properties

14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls

disabled; users with view permissions: Set(D062844)

14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

14/08/27 12:14:22 INFO Remoting: Starting remoting

14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@10.94.74.159:51157http://rk@10.94.74.159:51157]

14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@10.9

4.74.159:51157]

14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at 
C:\Users\D062844\AppDa

ta\Local\Temp\spark-local-20140827121422-dec8

14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.

14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = 
ConnectionM

anagerId(10.94.74.159,51160)

14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager

14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 
10.94.74.159:51160http://10.94.74.159:51160 with

294.9 MB RAM

14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at 
http://10.94.74.159:5116

1

14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is 
C:\Users\D062844\AppD

ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040

14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your pla

tform... using builtin-java classes where applicable

14/08/27 12:14:23 INFO SparkContext: Added JAR 

NotSerializableException while doing rdd.saveToCassandra

2014-08-27 Thread lmk
Hi All,
I am using spark-1.0.0 to parse a json file and save to values to cassandra
using case class.
My code looks as follows:
case class LogLine(x1:Option[String],x2:
Option[String],x3:Option[List[String]],x4:
Option[String],x5:Option[String],x6:Option[String],x7:Option[String],x8:Option[String],x9:Option[String])

val data = test.map(line =

{
parse(line)

}).map(json = {

  // Extract the values 
  
  implicit lazy val formats = org.json4s.DefaultFormats 
  
  
  val x1 = (json \ x1).extractOpt[String]
  val x2 = (json \ x2).extractOpt[String]
  val x4=(json \ x4).extractOpt[String]
  val x5=(json \ x5).extractOpt[String]
  val x6=(json \ x6).extractOpt[String]
  val x7=(json \ x7).extractOpt[String] 
  val x8=(json \ x8).extractOpt[String]
  val x3=(json \ x3).extractOpt[List[String]]
  val x9=(json \ x9).extractOpt[String]
  
LogLine(x1,x2,x3,x4,x5,x6,x7,x8,x9) 
})

data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5,
x6, x7, x8, x9))

whereas the cassandra table schema is as follows:
CREATE TABLE test_data (
x1 varchar,
x2 varchar,
x4 varchar,
x5 varchar,
x6 varchar,
x7 varchar,
x8 varchar,
x3 listtext ,
x9 varchar,
PRIMARY KEY (x1));

I am getting the following error on executing the saveToCassandra statement:

14/08/27 11:33:59 INFO SparkContext: Starting job: runJob at
package.scala:169
14/08/27 11:33:59 INFO DAGScheduler: Got job 5 (runJob at package.scala:169)
with 1 output partitions (allowLocal=false)
14/08/27 11:33:59 INFO DAGScheduler: Final stage: Stage 5(runJob at
package.scala:169)
14/08/27 11:33:59 INFO DAGScheduler: Parents of final stage: List()
14/08/27 11:33:59 INFO DAGScheduler: Missing parents: List()
14/08/27 11:33:59 INFO DAGScheduler: Submitting Stage 5 (MappedRDD[7] at map
at console:45), which has no missing parents
14/08/27 11:33:59 INFO DAGScheduler: Failed to run runJob at
package.scala:169
org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException: org.apache.spark.SparkConf

data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5,
x6, x7, x8, x9))

Here the data field is org.apache.spark.rdd.RDD[LogLine] = MappedRDD[7] at
map at console:45

How can I convert this to Serializable, or is this a different problem?
Please advise.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-while-doing-rdd-saveToCassandra-tp12906.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Example file not running

2014-08-27 Thread Hingorani, Vineet
Hello all,



I am able to use Spark in the shell but I am not able to run a spark file. I am 
using sbt and the jar is created but even the SimpleApp class example given on 
the site http://spark.apache.org/docs/latest/quick-start.html is not running. I 
installed a prebuilt version of spark and sbt package is compiling the scala 
file to jar. I am running it locally on my machine. It goes on like this and 
doesn't show me the result of count in the file. I had installed pre-built 
version of Spark1.0.2 named Hadoop2 version Spark on the site. The problem what 
I think is because I am running it on local machine and it is not able to find 
some dependencies of Hadoop. Please tell me what file should I download to work 
on my local machine (pre-built, so that I don't have to build it again).

The error log is huge but it starts with something like this:



14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/sp

ark/log4j-defaults.properties

14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls

disabled; users with view permissions: Set(D062844)

14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

14/08/27 12:14:22 INFO Remoting: Starting remoting

14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@10.94.74.159:51157http://rk@10.94.74.159:51157]

14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@10.9

4.74.159:51157]

14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at 
C:\Users\D062844\AppDa

ta\Local\Temp\spark-local-20140827121422-dec8

14/08/27 12:14:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.

14/08/27 12:14:22 INFO ConnectionManager: Bound socket to port 51160 with id = 
ConnectionM

anagerId(10.94.74.159,51160)

14/08/27 12:14:22 INFO BlockManagerMaster: Trying to register BlockManager

14/08/27 12:14:22 INFO BlockManagerInfo: Registering block manager 
10.94.74.159:51160http://10.94.74.159:51160 with

294.9 MB RAM

14/08/27 12:14:22 INFO BlockManagerMaster: Registered BlockManager

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO HttpBroadcast: Broadcast server started at 
http://10.94.74.159:5116

1

14/08/27 12:14:22 INFO HttpFileServer: HTTP File server directory is 
C:\Users\D062844\AppD

ata\Local\Temp\spark-d79d2857-3d85-4b16-8d76-ade83d465f10

14/08/27 12:14:22 INFO HttpServer: Starting HTTP Server

14/08/27 12:14:22 INFO SparkUI: Started SparkUI at http://10.94.74.159:4040

14/08/27 12:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your pla

tform... using builtin-java classes where applicable

14/08/27 12:14:23 INFO SparkContext: Added JAR 
file:/C:/Users/D062844/Desktop/HandsOnSpark

/Install/spark-1.0.2-bin-hadoop2/target/scala-2.10/simple-project_2.10-1.0.jar 
at 
http://10.94.74.159:51162/jars/simple-project_2.10-1.0.jarhttp://0.94.74.159:51162/jars/simple-project_2.10-1.0.jar
 with timestamp 1409134463198

14/08/27 12:14:23 INFO MemoryStore: ensureFreeSpace(138763) called with 
curMem=0, maxMem=3

09225062

14/08/27 12:14:23 INFO MemoryStore: Block broadcast_0 stored as values to 
memory (estimate

d size 135.5 KB, free 294.8 MB)

14/08/27 12:14:23 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary p

ath

java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binar

ies.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)

at org.apache.hadoop.util.Shell.clinit(Shell.java:293)

at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)

at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362

)

at 
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

at 
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)

at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



at scala.Option.map(Option.scala:145)

at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

at 

How to get prerelease thriftserver working?

2014-08-27 Thread Matt Chu
(apologies for sending this twice, first via nabble; didn't realize it
wouldn't get forwarded)

Hey, I know it's not officially released yet, but I'm trying to understand
(and run) the Thrift-based JDBC server, in order to enable remote JDBC
access to our dev cluster.

Before asking about details, is my understanding of this correct?
`sbin/start-thriftserver` is a JDBC/Hive server that doesn't require
running a Hive+MR cluster (i.e. just Spark/Spark+YARN)?

Assuming yes, I have hope that it all basically works, just that some
documentation needs to be cleaned up:

- I found a release page implying that 1.1 will be released pretty
soon-ish: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
- I can find recent (more recent 30 days or so) activity with promising
titles: [Updated Spark SQL README to include the hive-thriftserver
module](https://github.com/apache/spark/pull/1867), [[SPARK-2410][SQL]
Merging Hive Thrift/JDBC server (with Maven profile fix)](
https://github.com/apache/spark/pull/1620)

Am I following all the right email threads, issues trackers, and whatnot?

Specifically, I tried:

1. Building off of `branch-1.1`, synced as of ~today (2014 Aug 25)
2. Running `sbin/start-thriftserver.sh` in `yarn-client` mode
3. Can see the processing running, and the spark context/app created in
yarn logs,
and can connect to the thrift server on the default port of 1 using
`bin/beeline`
4. However, when I try to find out what that cluster has via `show
tables;`, in the logs
I see a connection error to some (what I assume to be) random port.

So what service am I forgetting/too ignorant to run? Or did I misunderstand
and we do need a live Hive instance to back thriftserver? Or is this a
YARN-specific issue?

Only recently started learning the ecosystem and community, so apologies
for the longer post and lots of questions. :)

Matt


RE: Installation On Windows machine

2014-08-27 Thread Mishra, Abhishek
I got it upright Matei, 
Thank you. I was giving wrong directory path. Thank you...!!

Thanks,

Abhishek Mishra
-Original Message-
From: Mishra, Abhishek [mailto:abhishek.mis...@xerox.com] 
Sent: Wednesday, August 27, 2014 4:38 PM
To: Matei Zaharia
Cc: user@spark.apache.org
Subject: RE: Installation On Windows machine

Thank you for the reply Matei,

Is there something which we missed. ? I am able to run the spark instance on my 
local system i.e. Windows 7 but the same set of steps do not allow me to run it 
on  Windows server 2012 machine. The black screen just appears for a fraction 
of second and disappear, I am unable to debug the same. Please guide me.

Thanks,

Abhishek


-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Saturday, August 23, 2014 9:47 AM
To: Mishra, Abhishek
Cc: user@spark.apache.org
Subject: Re: Installation On Windows machine

You should be able to just download / unzip a Spark release and run it on a 
Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. 
The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on 
Windows, but you can launch a standalone cluster manually using

bin\spark-class org.apache.spark.deploy.master.Master

and

bin\spark-class org.apache.spark.deploy.worker.Worker spark://master:port

For submitting jobs to YARN instead of the standalone cluster, spark-submit.cmd 
*may* work but I don't think we've tested it heavily. If you find issues with 
that, please let us know. But overall the instructions should be the same as on 
Linux, except you use the .cmd scripts instead of the .sh ones.

Matei

On Aug 22, 2014, at 3:01 AM, Mishra, Abhishek abhishek.mis...@xerox.com wrote:

 Hello Team,
  
 I was just trying to install spark on my windows server 2012 machine and use 
 it in my project; but unfortunately I do not find any documentation for the 
 same. Please let me know if we have drafted anything for spark users on 
 Windows. I am really in need of it as we are using Windows machine for Hadoop 
 and other tools and so cannot move back to Linux OS or anything. We run 
 Hadoop on hortonworks HDP2.0  platform and also recently I came across Spark 
 and so wanted use this even in my project for my Analytics work. Please 
 suggest me links or documents where I can move ahead with my installation and 
 usage. I want to run it on Java.
  
 Looking forward for a reply,
  
 Thanking you in Advance,
 Sincerely,
 Abhishek
  
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Example File not running

2014-08-27 Thread Akhil Das
You can install hadoop 2 by reading this doc
https://wiki.apache.org/hadoop/Hadoop2OnWindows Once you are done with it,
you can set the environment variable HADOOP_HOME then it should work.

Also Not sure if it will work, but can you provide file:// at the front and
give it a go? I don't see any requirement for hadoop here.

/* SimpleApp.scala */
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf

 object SimpleApp {
   def main(args: Array[String]) {
 val logFile = 
 *file://C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md*
 // Should be some file on your system
 val conf = new SparkConf().setAppName(Simple Application)
 val sc = new SparkContext(conf)
 val logData = sc.textFile(logFile, 2).cache()
 val numAs = logData.filter(line = line.contains(a)).count()
 val numBs = logData.filter(line = line.contains(b)).count()
 println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))
   }
 }




Thanks
Best Regards


On Wed, Aug 27, 2014 at 5:09 PM, Hingorani, Vineet vineet.hingor...@sap.com
 wrote:

  The code is the example given on Spark site:



 /* SimpleApp.scala */

 import org.apache.spark.SparkContext

 import org.apache.spark.SparkContext._

 import org.apache.spark.SparkConf



 object SimpleApp {

   def main(args: Array[String]) {

 val logFile =
 C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md
 // Should be some file on your system

 val conf = new SparkConf().setAppName(Simple Application)

 val sc = new SparkContext(conf)

 val logData = sc.textFile(logFile, 2).cache()

 val numAs = logData.filter(line = line.contains(a)).count()

 val numBs = logData.filter(line = line.contains(b)).count()

 println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))

   }

 }



 It goes on like this and doesn’t show me the result of count in the file.
 I had installed pre-built version of Spark1.0.2 named Hadoop2 version Spark
 on the site. The problem what I think is because I am running it on local
 machine and it is not able to find some dependencies of Hadoop. Please tell
 me what file should I download to work on my local machine (pre-built, so
 that I don’t have to build it again).









 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Mittwoch, 27. August 2014 13:35
 *To:* Hingorani, Vineet
 *Cc:* user@spark.apache.org
 *Subject:* Re: Example File not running



 It should point to your hadoop installation directory. (like C:\hadoop\)



 Since you don't have hadoop installed, What is the code that you are
 running?


   Thanks

 Best Regards



 On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet 
 vineet.hingor...@sap.com wrote:

 What should I put the value of that environment variable? I want to run
 the scripts locally on my machine and do not have any Hadoop installed.



 Thank you



 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Mittwoch, 27. August 2014 12:54
 *To:* Hingorani, Vineet
 *Cc:* user@spark.apache.org
 *Subject:* Re: Example File not running



 The statement java.io.IOException: Could not locate executable
 null\bin\winutils.exe



 explains that the null is received when expanding or replacing an
 Environment Variable.



 I'm guessing that you are missing *HADOOP_HOME* in the environment
 variables.


   Thanks

 Best Regards



 On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet 
 vineet.hingor...@sap.com wrote:

 Hello all,



 I am able to use Spark in the shell but I am not able to run a spark file.
 I am using sbt and the jar is created but even the SimpleApp class example
 given on the site http://spark.apache.org/docs/latest/quick-start.html is
 not running. I installed a prebuilt version of spark and sbt package is
 compiling the scala file to jar. I am running it locally on my machine. The
 error log is huge but it starts with something like this:



 14/08/27 12:14:21 INFO SecurityManager: Using Spark's default log4j
 profile: org/apache/sp

 ark/log4j-defaults.properties

 14/08/27 12:14:21 INFO SecurityManager: Changing view acls to: D062844

 14/08/27 12:14:21 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls

 disabled; users with view permissions: Set(D062844)

 14/08/27 12:14:22 INFO Slf4jLogger: Slf4jLogger started

 14/08/27 12:14:22 INFO Remoting: Starting remoting

 14/08/27 12:14:22 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spark@10.94.74.159:51157]

 14/08/27 12:14:22 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@10.9

 4.74.159:51157]

 14/08/27 12:14:22 INFO SparkEnv: Registering MapOutputTracker

 14/08/27 12:14:22 INFO SparkEnv: Registering BlockManagerMaster

 14/08/27 12:14:22 INFO DiskBlockManager: Created local directory at
 C:\Users\D062844\AppDa

 ta\Local\Temp\spark-local-20140827121422-dec8

 14/08/27 12:14:22 INFO MemoryStore: MemoryStore started 

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
It didn’t work after adding file:// in the front. I compiled it again and ran 
it. The same error are coming. Do you think there can be some problem with the 
java dependency? Also, I don’t want to install Hadoop I just want to run it on 
local machine. The reason is, whenever I install these things they don’t run 
due to some dependencies and then I have to give time to what dependencies are 
needed.



Thank you for helping but it is depressing that I am not even able to run a 
simple example with Spark.



Regards,

Vineet



From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Mittwoch, 27. August 2014 16:01
To: Hingorani, Vineet
Cc: user@spark.apache.org
Subject: Re: Example File not running



You can install hadoop 2 by reading this doc 
https://wiki.apache.org/hadoop/Hadoop2OnWindows Once you are done with it, you 
can set the environment variable HADOOP_HOME then it should work.



Also Not sure if it will work, but can you provide file:// at the front and 
give it a go? I don't see any requirement for hadoop here.



   /* SimpleApp.scala */
   import org.apache.spark.SparkContext
   import org.apache.spark.SparkContext._
   import org.apache.spark.SparkConf

   object SimpleApp {
 def main(args: Array[String]) {
   val logFile = 
file://C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.mdfile:///C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0.2-bin-hadoop2\README.md
 // Should be some file on your system
   val conf = new SparkConf().setAppName(Simple Application)
   val sc = new SparkContext(conf)
   val logData = sc.textFile(logFile, 2).cache()
   val numAs = logData.filter(line = line.contains(a)).count()
   val numBs = logData.filter(line = line.contains(b)).count()
   println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))
 }
   }








   Thanks

   Best Regards



   On Wed, Aug 27, 2014 at 5:09 PM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:

   The code is the example given on Spark site:



   /* SimpleApp.scala */

   import org.apache.spark.SparkContext

   import org.apache.spark.SparkContext._

   import org.apache.spark.SparkConf



   object SimpleApp {

 def main(args: Array[String]) {

   val logFile = 
C:/Users/D062844/Desktop/HandsOnSpark/Install/spark-1.0.2-bin-hadoop2/README.md
 // Should be some file on your system

   val conf = new SparkConf().setAppName(Simple Application)

   val sc = new SparkContext(conf)

   val logData = sc.textFile(logFile, 2).cache()

   val numAs = logData.filter(line = line.contains(a)).count()

   val numBs = logData.filter(line = line.contains(b)).count()

   println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))

 }

   }



   It goes on like this and doesn’t show me the result of count in the file. I 
had installed pre-built version of Spark1.0.2 named Hadoop2 version Spark on 
the site. The problem what I think is because I am running it on local machine 
and it is not able to find some dependencies of Hadoop. Please tell me what 
file should I download to work on my local machine (pre-built, so that I don’t 
have to build it again).









   From: Akhil Das 
[mailto:ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com]
   Sent: Mittwoch, 27. August 2014 13:35
   To: Hingorani, Vineet
   Cc: user@spark.apache.orgmailto:user@spark.apache.org
   Subject: Re: Example File not running



   It should point to your hadoop installation directory. (like C:\hadoop\)



   Since you don't have hadoop installed, What is the code that you are running?




   Thanks

   Best Regards



   On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:

   What should I put the value of that environment variable? I want to run the 
scripts locally on my machine and do not have any Hadoop installed.



   Thank you



   From: Akhil Das 
[mailto:ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com]
   Sent: Mittwoch, 27. August 2014 12:54
   To: Hingorani, Vineet
   Cc: user@spark.apache.orgmailto:user@spark.apache.org
   Subject: Re: Example File not running



   The statement java.io.IOException: Could not locate executable 
null\bin\winutils.exe



   explains that the null is received when expanding or replacing an 
Environment Variable.



   I'm guessing that you are missing HADOOP_HOME in the environment variables.




   Thanks

   Best Regards



   On Wed, Aug 27, 2014 at 3:52 PM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:

   Hello all,



   I am able to use Spark in the shell but I am not able to run a spark file. I 
am using sbt and the jar is created but even the SimpleApp class example given 
on the site http://spark.apache.org/docs/latest/quick-start.html is not 
running. I installed a prebuilt version of spark and sbt package is compiling 
the scala file to 

Re: Does HiveContext support Parquet?

2014-08-27 Thread Silvio Fiorito
What Spark and Hadoop versions are you on? I have it working in my Spark
app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar.
I¹m running Spark 1.0.2 and CDH5.

bin/spark-shell --master local[*] --driver-class-path
~/parquet-hive-bundle-1.5.0.jar

To see if that works?

On 8/26/14, 6:03 PM, lyc yanchen@huawei.com wrote:

Hi Silvio,

I re-downloaded hive-0.12-bin and reset the related path in spark-env.sh.
However, I still got some error. Do you happen to know any step I did
wrong?
Thank you!

My detailed step is as follows:
#enter spark-shell (successful)
/bin/spark-shell --master spark://S4:7077 --jars
/home/hduser/parquet-hive-bundle-1.5.0.jar
#import related hiveContext (successful)
...
# create parquet table:
hql(CREATE TABLE parquet_test (id int, str string, mp MAPSTRING,STRING,
lst ARRAYSTRING, strct STRUCTA:STRING,B:STRING) PARTITIONED BY (part
string) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT
'parquet.hive.DeprecatedParquetOutputFormat')

get error:

14/08/26 21:59:20 ERROR exec.DDLTask: java.lang.NoClassDefFoundError:
Could
not initialize class
org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetPrimitiveInspe
ctorFactory
   at
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.ge
tObjectInspector(ArrayWritableObjectInspector.java:77)
   at
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.i
nit(ArrayWritableObjectInspector.java:59)
   at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(Par
quetHiveSerDe.java:113)
   at
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreU
tils.java:218)
   at
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Tabl
e.java:272)
   at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:265)
   at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:597)
   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576)
   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65
)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:186)
   at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:160)
   at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(Hive
Context.scala:250)
   at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.sca
la:247)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:85)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:90)
   at $line34.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:18)
   at $line34.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:23)
   at $line34.$read$$iwC$$iwC$$iwC$$iwC.init(console:25)
   at $line34.$read$$iwC$$iwC$$iwC.init(console:27)
   at $line34.$read$$iwC$$iwC.init(console:29)
   at $line34.$read$$iwC.init(console:31)
   at $line34.$read.init(console:33)
   at $line34.$read$.init(console:37)
   at $line34.$read$.clinit(console)
   at $line34.$eval$.init(console:7)
   at $line34.$eval$.clinit(console)
   at $line34.$eval.$print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
57)
   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
pl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
   at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
   at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
   at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:84
1)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
   at

Saddle structure in Spark

2014-08-27 Thread LPG
Hello everyone,

Is it possible to use an external data structure, such as Saddle, in
Spark? As far as I know, a RDD is a kind of wrapper or container that has
certain data structure inside. So I was wondering whether this data
structure has to be either a basic (or native) structure or any available
structure.

Thanks in advance fellows



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saddle-structure-in-Spark-tp12916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming Output to DB

2014-08-27 Thread Ravi Sharma
Thank you Akhil and Mayur.
It will be really helpful.

Thanks,
On 27 Aug 2014 13:19, Akhil Das ak...@sigmoidanalytics.com wrote:

 Like Mayur said, its better to use mapPartition instead of map.

 Here's a piece of code which typically reads a text file and inserts each
 raw into the database. I haven't tested it, It might throw up some
 Serialization errors, In that case, you gotta serialize them!

 JavaRDDString txtRDD = jsc.textFile(/sigmoid/logs.cap);

  JavaRDDString dummyRDD = txtRDD.map(new FunctionString,String(){
  @Override
 public String call(String raw) throws Exception {

 * //Creating JDBC Connection*
 Class.forName(oracle.jdbc.driver.OracleDriver);
  Driver myDriver = new oracle.jdbc.driver.OracleDriver();
 DriverManager.registerDriver( myDriver);
  String URL = jdbc:oracle:thin:AkhlD/password@akhldz:1521:EMP;
 Connection conn = DriverManager.getConnection(URL);
  Statement stmt = conn.createStatement();


 * //Do some operations!!* stmt.executeQuery(INSERT INTO logs
 VALUES(null, ' + raw + ');


 * //Now close the connection * conn.close();
 return raw +  [ Added ];
  }
 });


 Thanks
 Best Regards


 On Wed, Aug 27, 2014 at 10:06 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 I would suggest you to use JDBC connector in mappartition instead of maps
 as JDBC connections are costly  can really impact your performance.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Tue, Aug 26, 2014 at 6:45 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Yes, you can open a jdbc connection at the beginning of the map method
 then close this connection at the end of map() and in between you can use
 this connection.

 Thanks
 Best Regards


 On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma raviprincesha...@gmail.com
  wrote:

 Hello People,

  I'm using java spark streaming. I'm just wondering, Can I make simple
 jdbc connection in JavaDStream map() method?

 Or

 Do  I need to create jdbc connection for each JavaPairDStream, after
 map task?

 Kindly give your thoughts.


 Cheers,
 Ravi Sharma








Reference Accounts Large Node Deployments

2014-08-27 Thread Steve Nunez
All,

Does anyone have specific references to customers, use cases and large-scale
deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and
number of nodes. I¹m attempting an objective comparison of Streaming and
Storm and while this data is known for Storm, there appears to be little for
Spark Streaming. If you know of any such deployments, please post them here
because I am sure I¹m not the only one wondering about this. If customer
confidentially prevents mentioning them by name, consider identifying them
by industry, e.g. OEtelco doing X with streaming using Y nodes¹.

Any information at all will be welcome. I¹ll feed back a summary and/or
update a wiki page once I collate the information.

Cheers,
- Steve






-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Execute HiveFormSpark ERROR.

2014-08-27 Thread Du Li
As suggested in the error messages, double-check your class path.



From: CharlieLin chury...@gmail.commailto:chury...@gmail.com
Date: Tuesday, August 26, 2014 at 8:29 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Execute HiveFormSpark ERROR.

hi, all :
I tried to use Spark SQL on spark-shell, as the spark-example.
When I execute :

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

then report error like below:

scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
14/08/27 11:08:19 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS 
src (key INT, value STRING)
14/08/27 11:08:19 INFO ParseDriver: Parse Completed
14/08/27 11:08:19 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations
14/08/27 11:08:19 INFO Analyzer: Max iterations (2) reached for batch 
CaseInsensitiveAttributeReferences
14/08/27 11:08:19 INFO Analyzer: Max iterations (2) reached for batch Check 
Analysis
14/08/27 11:08:19 INFO SQLContext$$anon$1: Max iterations (2) reached for batch 
Add exchange
14/08/27 11:08:19 INFO SQLContext$$anon$1: Max iterations (2) reached for batch 
Prepare Expressions
14/08/27 11:08:19 INFO Driver: PERFLOG method=Driver.run
14/08/27 11:08:19 INFO Driver: PERFLOG method=TimeToSubmit
14/08/27 11:08:19 INFO Driver: PERFLOG method=compile
14/08/27 11:08:19 INFO Driver: PERFLOG method=parse
14/08/27 11:08:19 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS 
src (key INT, value STRING)
14/08/27 11:08:19 INFO ParseDriver: Parse Completed
14/08/27 11:08:19 INFO Driver: /PERFLOG method=parse start=1409108899822 
end=1409108899822 duration=0
14/08/27 11:08:19 INFO Driver: PERFLOG method=semanticAnalyze
14/08/27 11:08:19 INFO SemanticAnalyzer: Starting Semantic Analysis
14/08/27 11:08:19 INFO SemanticAnalyzer: Creating table src position=27
14/08/27 11:08:19 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
14/08/27 11:08:19 INFO ObjectStore: ObjectStore, initialize called
14/08/27 11:08:20 WARN General: Plugin (Bundle) org.datanucleus is already 
registered. Ensure you dont have multiple JAR versions of the same plugin in 
the classpath. The URL 
file:/home/spark/spark/lib_managed/jars/datanucleus-core-3.2.2.jarfile:/home/spark/spark/lib_managed/jars/datanucleus-core-3.2.2.jar
 is already registered, and you are trying to register an identical plugin 
located at URL 
file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-core-3.2.2.jar.file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-core-3.2.2.jar.
14/08/27 11:08:20 WARN General: Plugin (Bundle) org.datanucleus.store.rdbms 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-rdbms-3.2.1.jarfile:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-rdbms-3.2.1.jar
 is already registered, and you are trying to register an identical plugin 
located at URL 
file:/home/spark/spark/lib_managed/jars/datanucleus-rdbms-3.2.1.jar.file:/home/spark/spark/lib_managed/jars/datanucleus-rdbms-3.2.1.jar.
14/08/27 11:08:20 WARN General: Plugin (Bundle) org.datanucleus.api.jdo is 
already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
file:/home/spark/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jarfile:/home/spark/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar
 is already registered, and you are trying to register an identical plugin 
located at URL 
file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar.file:/home/spark/spark-1.0.2-2.0.0-mr1-cdh-4.2.1/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar.
14/08/27 11:08:20 INFO Persistence: Property datanucleus.cache.level2 unknown - 
will be ignored
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table src
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:189)
at 

Re: What is a Block Manager?

2014-08-27 Thread Victor Tso-Guillen
I have long-lived state I'd like to maintain on the executors that I'd like
to initialize during some bootstrap phase and to update the master when
such executor leaves the cluster.


On Tue, Aug 26, 2014 at 11:18 PM, Liu, Raymond raymond@intel.com
wrote:

 The framework have those info to manage cluster status, and these info
 (e.g. worker number) is also available through spark metrics system.
 While from the user application's point of view, can you give an example
 why you need these info, what would you plan to do with them?

 Best Regards,
 Raymond Liu

 From: Victor Tso-Guillen [mailto:v...@paxata.com]
 Sent: Wednesday, August 27, 2014 1:40 PM
 To: Liu, Raymond
 Cc: user@spark.apache.org
 Subject: Re: What is a Block Manager?

 We're a single-app deployment so we want to launch as many executors as
 the system has workers. We accomplish this by not configuring the max for
 the application. However, is there really no way to inspect what
 machines/executor ids/number of workers/etc is available in context? I'd
 imagine that there'd be something in the SparkContext or in the listener,
 but all I see in the listener is block managers getting added and removed.
 Wouldn't one care about the workers getting added and removed at least as
 much as for block managers?

 On Tue, Aug 26, 2014 at 6:58 PM, Liu, Raymond raymond@intel.com
 wrote:
 Basically, a Block Manager manages the storage for most of the data in
 spark, name a few: block that represent a cached RDD partition,
 intermediate shuffle data, broadcast data etc. it is per executor, while in
 standalone mode, normally, you have one executor per worker.

 You don't control how many worker you have at runtime, but you can somehow
 manage how many executors your application will launch  Check different
 running mode's documentation for details  ( but control where? Hardly, yarn
 mode did some works based on data locality, but this is done by framework
 not user program).

 Best Regards,
 Raymond Liu

 From: Victor Tso-Guillen [mailto:v...@paxata.com]
 Sent: Tuesday, August 26, 2014 11:42 PM
 To: user@spark.apache.org
 Subject: What is a Block Manager?

 I'm curious not only about what they do, but what their relationship is to
 the rest of the system. I find that I get listener events for n block
 managers added where n is also the number of workers I have available to
 the application. Is this a stable constant?

 Also, are there ways to determine at runtime how many workers I have and
 where they are?

 Thanks,
 Victor




RE: Save an RDD to a SQL Database

2014-08-27 Thread bdev
I have similar requirement to export the data to mysql. Just wanted to know
what the best approach is so far after the research you guys have done.
Currently thinking of saving to hdfs and use sqoop to handle export. Is that
the best approach or is there any other way to write to mysql? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p12921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Wei Tan
Thank you all. Actually I was looking at JCUDA. Function wise this may be 
a perfect solution to offload computation to GPU. Will see how performance 
it will be, especially with the Java binding.

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Chen He airb...@gmail.com
To: Antonio Jesus Navarro ajnava...@stratio.com, 
Cc: Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, 
Wei Tan/Watson/IBM@IBMUS
Date:   08/27/2014 11:03 AM
Subject:Re: CUDA in spark, especially in MLlib?



JCUDA can let you do that in Java
http://www.jcuda.org


On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro 
ajnava...@stratio.com wrote:
Maybe this would interest you:

CPU and GPU-accelerated Machine Learning Library:

https://github.com/BIDData/BIDMach


2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:

You should try to find a Java-based library, then you can call it from 
Scala.

Matei

On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote:
Hi I am trying to find a CUDA library in Scala, to see if some matrix 
manipulation in MLlib can be sped up.

I googled a few but found no active projects on Scala+CUDA. Python is 
supported by CUDA though. Any suggestion on whether this idea makes any 
sense?

Best regards,
Wei





Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
I have the same issue (I'm using the latest 1.1.0-SNAPSHOT).

I've increased my driver memory to 30G, executor memory to 10G,
and spark.akka.askTimeout to 180. Still no good. My other configurations
are:

spark.serializer
 org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.mb  256
spark.shuffle.consolidateFiles  true
spark.shuffle.file.buffer.kb400
spark.akka.frameSize500
spark.akka.timeout  600
spark.akka.askTimeout   180
spark.core.connection.auth.wait.timeout 300

However I just got informed that the YARN cluster I'm using *throttles the
resource for default queue. *Not sure if it's related.


Jianshi




On Wed, Aug 27, 2014 at 5:15 AM, Andrew Ash and...@andrewash.com wrote:

 Hi Grega,

 Did you ever get this figured out?  I'm observing the same issue in Spark
 1.0.2.

 For me it was after 1.5hr of a large .distinct call, followed by a
 .saveAsTextFile()

  14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got
 assigned task 18500
  14/08/26 20:57:43 INFO executor.Executor: Running task ID 18500
  14/08/26 20:57:43 INFO storage.BlockManager: Found block broadcast_0
 locally
  14/08/26 20:57:43 INFO spark.MapOutputTrackerWorker: Don't have map
 outputs for shuffle 0, fetching them
  14/08/26 20:58:13 ERROR executor.Executor: Exception in task ID 18491
  org.apache.spark.SparkException: Error communicating with MapOutputTracker
  at
 org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:108)
  at
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:155)
  at
 org.apache.spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:42)
  at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:65)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
  at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
  at org.apache.spark.scheduler.Task.run(Task.scala:51)
  at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
  Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]
  at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
  at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
  at
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
  at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)
  at
 org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:105)
  ... 23 more


 On Tue, Mar 11, 2014 at 3:07 PM, Grega Kespret gr...@celtra.com wrote:

  Your input data read as RDD may be causing OOM, so thats where you can
 use different memory configuration.

 We are not getting any OOM exceptions, just akka future timeouts in
 mapoutputtracker and unsuccessful get of shuffle outputs, therefore
 refetching them.

 What is the industry practice when going about debugging such errors?

 Questions:
 - why are mapoutputtrackers timing out? ( and how to debug this properly?)
 - what is the task/purpose of mapoutputtracker?
 - how to check per-task objects size?

 Thanks,
 Grega

 On 11 Mar 2014, at 18:43, Mayur Rustagi mayur.rust...@gmail.com wrote:

 Shuffle data is always stored on disk, its unlikely to cause OOM. Your
 input data read as RDD may be causing OOM, so thats where you can use
 different memory configuration.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Tue, Mar 11, 2014 at 9:20 AM, sparrow do...@celtra.com wrote:

 I don't understand how exactly will that help. 

Re: Specifying classpath

2014-08-27 Thread Ashish Jain
I solved this issue by putting hbase-protobuf in Hadoop classpath, and not
in the spark classpath.

export HADOOP_CLASSPATH=/path/to/jar/hbase-protocol-0.98.1-cdh5.1.0.jar



On Tue, Aug 26, 2014 at 5:42 PM, Ashish Jain ashish@gmail.com wrote:

 Hello,

 I'm using the following version of Spark - 1.0.0+cdh5.1.0+41
 (1.cdh5.1.0.p0.27).

 I've tried to specify the libraries Spark uses using the following ways -

 1) Adding it to spark context
 2) Specifying the jar path in
   a) spark.executor.extraClassPath
   b) spark.executor.extraLibraryPath
 3) Copying the libraries to spark/lib
 4) Specifying the path in SPARK_CLASSPATH, even SPARK_LIBRARY_PATH
 5) Passing as --jars argument

 but the spark application is not able to pick up the libraries although, I
 can see the message SparkContext: Added JAR file. I get a NoClassDef
 found error.

 The only way I've been to able to make it work right now is by merging my
 application jar with all the library jars.

 What might be going on?

 I need it right now to specify hbase-protocol-0.98.1-cdh5.1.0.jar in
 SPARK_CLASSPATH as mentioned here
 https://issues.apache.org/jira/browse/HBASE-10877. I'm using spark-submit
 to submit the job

 Thanks
 Ashish




Re: Issue Connecting to HBase in spark shell

2014-08-27 Thread kpeng1
It looks like the issue I had is that I didn't pull in htrace-core jar into
the spark class path.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-Connecting-to-HBase-in-spark-shell-tp12855p12924.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Xiangrui Meng
Hi Wei,

Please keep us posted about the performance result you get. This would
be very helpful.

Best,
Xiangrui

On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote:
 Thank you all. Actually I was looking at JCUDA. Function wise this may be a
 perfect solution to offload computation to GPU. Will see how performance it
 will be, especially with the Java binding.

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 http://researcher.ibm.com/person/us-wtan



 From:Chen He airb...@gmail.com
 To:Antonio Jesus Navarro ajnava...@stratio.com,
 Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org,
 Wei Tan/Watson/IBM@IBMUS
 Date:08/27/2014 11:03 AM
 Subject:Re: CUDA in spark, especially in MLlib?
 



 JCUDA can let you do that in Java
 http://www.jcuda.org


 On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro
 ajnava...@stratio.com wrote:
 Maybe this would interest you:

 CPU and GPU-accelerated Machine Learning Library:

 https://github.com/BIDData/BIDMach


 2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:

 You should try to find a Java-based library, then you can call it from
 Scala.

 Matei

 On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote:

 Hi I am trying to find a CUDA library in Scala, to see if some matrix
 manipulation in MLlib can be sped up.

 I googled a few but found no active projects on Scala+CUDA. Python is
 supported by CUDA though. Any suggestion on whether this idea makes any
 sense?

 Best regards,
 Wei




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1. doesn't work with hive context

2014-08-27 Thread S Malligarjunan
It is my mistake, some how I have added the io.compression.codec property value 
as the above mentioned class. Resolved the problem now
 
Thanks and Regards,
Sankar S.  



On Wednesday, 27 August 2014, 1:23, S Malligarjunan smalligarju...@yahoo.com 
wrote:
 


Hello all,

I have just checked out branch-1.1 
and executed below command
./bin/spark-shell --driver-memory 1G
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 
hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) 
hiveContext.hql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src) // Queries are expressed in HiveQL hiveContext.hql(FROM src 
SELECT key, value).collect().foreach(println)

I am getting the following exception

Caused by: java.lang.IllegalArgumentException: Compression codec 
com.hadoop.compression.lzo.LzoCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:175)
at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 72 more
Caused by: java.lang.ClassNotFoundException: Class 
com.hadoop.compression.lzo.LzoCodec not found

 
Thanks and Regards,
Sankar S.  

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Bharat Venkat
Hi Dibyendu,

That would be great.  One of the biggest drawback of Kafka utils as well as
your implementation is I am unable to scale out processing.  I am
relatively new to Spark and Spark Streaming - from what I read and what I
observe with my deployment is that having the RDD created on one receiver
is processed by at most 2 nodes in my cluster (most likely because default
replication is 2 and spark schedules processing close to where the data
is).  I tried rdd.replicate() to no avail.

Would Chris and your proposal to have union of DStreams for all these
Receivers still allow scaling out subsequent processing?

Thanks,
Bharat


On Tue, Aug 26, 2014 at 10:59 PM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:


 Thanks Chris and Bharat for your inputs. I agree, running multiple
 receivers/dstreams is desirable for scalability and fault tolerant. and
 this is easily doable. In present KafkaReceiver I am creating as many
 threads for each kafka topic partitions, but I can definitely create
 multiple KafkaReceivers for every partition. As Chris mentioned , in this
 case I need to then have union of DStreams for all these Receivers. I will
 try this out and let you know.

 Dib


 On Wed, Aug 27, 2014 at 9:10 AM, Chris Fregly ch...@fregly.com wrote:

 great work, Dibyendu.  looks like this would be a popular contribution.

 expanding on bharat's question a bit:

 what happens if you submit multiple receivers to the cluster by creating
 and unioning multiple DStreams as in the kinesis example here:


 https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123

 for more context, the kinesis implementation above uses the Kinesis
 Client Library (KCL) to automatically assign - and load balance - stream
 shards among all KCL threads from all receivers (potentially coming and
 going as nodes die) on all executors/nodes using DynamoDB as the
 association data store.

 ZooKeeper would be used for your Kafka consumers, of course.  and
 ZooKeeper watches to handle the ephemeral nodes.  and I see you're using
 Curator, which makes things easier.

 as bharat suggested, running multiple receivers/dstreams may be desirable
 from a scalability and fault tolerance standpoint.  is this type of load
 balancing possible among your different Kafka consumers running in
 different ephemeral JVMs?

 and isn't it fun proposing a popular piece of code?  the question
 floodgates have opened!  haha. :)

  -chris



 On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya 
 dibyendu.bhattach...@gmail.com wrote:

 Hi Bharat,

 Thanks for your email. If the Kafka Reader worker process dies, it
 will be replaced by different machine, and it will start consuming from the
 offset where it left over ( for each partition). Same case can happen even
 if I tried to have individual Receiver for every partition.

 Regards,
 Dibyendu


 On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat bvenkat.sp...@gmail.com
 wrote:

 I like this consumer for what it promises - better control over offset
 and
 recovery from failures.  If I understand this right, it still uses
 single
 worker process to read from Kafka (one thread per partition) - is there
 a
 way to specify multiple worker processes (on different machines) to read
 from Kafka?  Maybe one worker process for each partition?

 If there is no such option, what happens when the single machine
 hosting the
 Kafka Reader worker process dies and is replaced by a different
 machine
 (like in cloud)?

 Thanks,
 Bharat



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of our 
cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions 
about data avallability at:
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
 /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to 
doanload these datasets directly. Here is what I see. I read that they can be 
used directly by doing : sc.textFile(s3:/). However, I wanted to make sure 
that my understanding is correct. Here is what I see at 
http://s3.amazonaws.com/big-data-benchmark/
I do not see anything for sequence or text-deflate.
I see sequence-snappy dataset:
ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents
For text, I get the following error:
ErrorCodeNoSuchKey/CodeMessageThe specified key does not 
exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error

Please let me know if there is a way to readily download the dataset and view 
it. 

Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer,

I've faced this issue before. They don't show up on 
http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: 
`sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)`
The gotcha is that you also need to supply which dataset you want: crawl, 
uservisits, or rankings in lower case after the format and size you want them 
in.
They should be there.

Best,
Burak

- Original Message -
From: Sameer Tilak ssti...@live.com
To: user@spark.apache.org
Sent: Wednesday, August 27, 2014 11:42:28 AM
Subject: Amplab: big-data-benchmark

Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of our 
cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions 
about data avallability at:
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
 /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to 
doanload these datasets directly. Here is what I see. I read that they can be 
used directly by doing : sc.textFile(s3:/). However, I wanted to make sure 
that my understanding is correct. Here is what I see at 
http://s3.amazonaws.com/big-data-benchmark/
I do not see anything for sequence or text-deflate.
I see sequence-snappy dataset:
ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents
For text, I get the following error:
ErrorCodeNoSuchKey/CodeMessageThe specified key does not 
exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error

Please let me know if there is a way to readily download the dataset and view 
it. 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does HiveContext support Parquet?

2014-08-27 Thread lyc
Thanks a lot. Finally, I can create parquet table using your command
-driver-class-path.

I am using hadoop 2.3. Now, I will try to load data into the tables.

Thanks,
lyc



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12931.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi Burak,Thanks,  I will then start benchmarking the cluster.

 Date: Wed, 27 Aug 2014 11:52:05 -0700
 From: bya...@stanford.edu
 To: ssti...@live.com
 CC: user@spark.apache.org
 Subject: Re: Amplab: big-data-benchmark
 
 Hi Sameer,
 
 I've faced this issue before. They don't show up on 
 http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: 
 `sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)`
 The gotcha is that you also need to supply which dataset you want: crawl, 
 uservisits, or rankings in lower case after the format and size you want them 
 in.
 They should be there.
 
 Best,
 Burak
 
 - Original Message -
 From: Sameer Tilak ssti...@live.com
 To: user@spark.apache.org
 Sent: Wednesday, August 27, 2014 11:42:28 AM
 Subject: Amplab: big-data-benchmark
 
 Hi All,
 I am planning to run amplab benchmark suite to evaluate the performance of 
 our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it 
 mentions about data avallability at:
 s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
  /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able 
 to doanload these datasets directly. Here is what I see. I read that they can 
 be used directly by doing : sc.textFile(s3:/). However, I wanted to make 
 sure that my understanding is correct. Here is what I see at 
 http://s3.amazonaws.com/big-data-benchmark/
 I do not see anything for sequence or text-deflate.
 I see sequence-snappy dataset:
 ContentsKeypavlo/sequence-snappy/5nodes/crawl/000738_0/KeyLastModified2013-05-27T21:26:40.000Z/LastModifiedETaga978d18721d5a533d38a88f558461644/ETagSize42958735/SizeStorageClassSTANDARD/StorageClass/Contents
 For text, I get the following error:
 ErrorCodeNoSuchKey/CodeMessageThe specified key does not 
 exist./MessageKeypavlo/text/1node/crawl/KeyRequestId166D239D38399526/RequestIdHostId4Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI/HostId/Error
 
 Please let me know if there is a way to readily download the dataset and view 
 it.   
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

Re: Does HiveContext support Parquet?

2014-08-27 Thread Michael Armbrust
I'll note the parquet jars are included by default in 1.1


On Wed, Aug 27, 2014 at 11:53 AM, lyc yanchen@huawei.com wrote:

 Thanks a lot. Finally, I can create parquet table using your command
 -driver-class-path.

 I am using hadoop 2.3. Now, I will try to load data into the tables.

 Thanks,
 lyc



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12931.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to get prerelease thriftserver working?

2014-08-27 Thread Michael Armbrust
I would expect that to work.  What exactly is the error?


On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu m...@kabam.com wrote:

 (apologies for sending this twice, first via nabble; didn't realize it
 wouldn't get forwarded)

 Hey, I know it's not officially released yet, but I'm trying to understand
 (and run) the Thrift-based JDBC server, in order to enable remote JDBC
 access to our dev cluster.

 Before asking about details, is my understanding of this correct?
 `sbin/start-thriftserver` is a JDBC/Hive server that doesn't require
 running a Hive+MR cluster (i.e. just Spark/Spark+YARN)?

 Assuming yes, I have hope that it all basically works, just that some
 documentation needs to be cleaned up:

 - I found a release page implying that 1.1 will be released pretty
 soon-ish: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
 - I can find recent (more recent 30 days or so) activity with promising
 titles: [Updated Spark SQL README to include the hive-thriftserver
 module](https://github.com/apache/spark/pull/1867), [[SPARK-2410][SQL]
 Merging Hive Thrift/JDBC server (with Maven profile fix)](
 https://github.com/apache/spark/pull/1620)

 Am I following all the right email threads, issues trackers, and whatnot?

 Specifically, I tried:

 1. Building off of `branch-1.1`, synced as of ~today (2014 Aug 25)
 2. Running `sbin/start-thriftserver.sh` in `yarn-client` mode
 3. Can see the processing running, and the spark context/app created in
 yarn logs,
 and can connect to the thrift server on the default port of 1 using
 `bin/beeline`
 4. However, when I try to find out what that cluster has via `show
 tables;`, in the logs
 I see a connection error to some (what I assume to be) random port.

 So what service am I forgetting/too ignorant to run? Or did I
 misunderstand and we do need a live Hive instance to back thriftserver? Or
 is this a YARN-specific issue?

 Only recently started learning the ecosystem and community, so apologies
 for the longer post and lots of questions. :)

 Matt



RE: Execution time increasing with increase of cluster size

2014-08-27 Thread Sameer Tilak
Can you tell which nodes were doing the computation in each case?

Date: Wed, 27 Aug 2014 20:29:38 +0530
Subject: Execution time increasing with increase of cluster size
From: sarathchandra.jos...@algofusiontech.com
To: user@spark.apache.org

Hi,
I've written a simple scala program which reads a file on HDFS (which is a 
delimited file having 100 fields and 1 million rows), splits each row with 
delimiter, deduces hashcode of each field, makes new rows with these hashcodes 
and writes these rows back to HDFS. Code attached.

When I run this on spark cluster of 2 nodes (these 2 nodes also act as HDFS 
cluster) it took about 35sec to complete. Then I increased the cluster to 4 
nodes (additional nodes are not part of HDFS cluster) and submitted the same 
job. I was expecting a decrease in the execution time but instead it took 3 
times more time (1.6 min) to complete. Attached snapshots of the execution 
summary.

Both the times I've set executor memory to 6GB which is available in all the 
nodes.
What am I'm missing here? Do I need to do any additional configuration when 
increasing the cluster size?

~Sarath


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org 
  

Re: hive on spark yarn

2014-08-27 Thread Michael Armbrust
You need to have the datanuclus jars on your classpath.  It is not okay to
merge them into an uber jar.


On Wed, Aug 27, 2014 at 1:44 AM, centerqi hu cente...@gmail.com wrote:

 Hi all


 When I run a simple SQL, encountered the following error.

 hive:0.12(metastore in mysql)

 hadoop 2.4.1

 spark 1.0.2 build with hive


 my hql code

 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql._
 import org.apache.spark.sql.hive.LocalHiveContext
 object HqlTest {
   case class Record(key: Int, value: String)
   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(HiveFromSpark)
 val sc = new SparkContext(sparkConf)
   val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import hiveContext._
 hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
   }

 }






 14/08/27 16:07:08 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 14/08/27 16:07:08 INFO ObjectStore: ObjectStore, initialize called
 org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table src
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313)
 at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:189)
 at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:163)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
 at
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:250)
 at
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:250)
 at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75)
 at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78)
 at HqlTest$.main(HqlTest.scala:15)
 at HqlTest.main(HqlTest.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
 Caused by: java.lang.RuntimeException: Unable to instantiate
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
 at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62)
 at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
 at
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
 ... 26 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1210)
 ... 31 more
 Caused by: javax.jdo.JDOFatalUserException: Class
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
 NestedThrowables:
 java.lang.ClassNotFoundException:
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory
 at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
 at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
 at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304)
 at
 

Spark N.C.

2014-08-27 Thread am
 Looking for fellow Spark enthusiasts based in and around Research Triangle 
Park, Raleigh, Durham, and Chapel Hill, North Carolina 

Please get in touch off list for an employment opportunity. Must be local. 
Thanks! 

-Andrew
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLBase status

2014-08-27 Thread Sameer Tilak
Hi All,I was wondering can someone please tell me the status of MLbase and its 
roadmap in terms of software release. We are very interested in exploring it 
for our applications. 

[Streaming] Cannot get executors to stay alive

2014-08-27 Thread Yana Kadiyska
Hi, I tried a similar question before and didn't get any answers,so I'll
try again:

I am using updateStateByKey, pretty much exactly as shown in the examples
shipping with Spark:

 def createContext(master:String,dropDir:String, checkpointDirectory:String) = {
val updateFunc = (values: Seq[Int], state: Option[Int]) = {
  val currentCount = values.sum
  val previousCount = state.getOrElse(0)
  Some(currentCount + previousCount)
}

val sparkConf = new SparkConf()
  .setMaster(master)
  .setAppName(StatefulNetworkWordCountNoMem)
  .setJars(StreamingContext.jarOfClass(this.getClass));


val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint(checkpointDirectory)

   val lines = ssc.textFileStream(dropDir)

val wordDstream = lines.map(line ={
  val x = line.split(\t)
 ((x(0), x(1),x(2)), 1)
})
wordDstream.print()
val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
println(Printing stream)
stateDstream.print()
ssc
  }

​


I am running this over a _static_ drop directory (i.e. files are there at
the beginning of time, nothing new is coming).

The only difference is that I have a pretty wide key space -- about 440K,
each key is of length under 20 chars.

The code exactly as shown above runs for a while (about an hour) and then
the executors start dying with OOM exceptions. I tried executor memory from
512M to 2g (4 executors)-- the only difference is how long it takes for the
OOM. The only task that keeps executing is take at DStream(line
586)...which makes sense. What doesn't make sense is why the memory isn't
getting reclaimed

What am I doing wrong? I'd like to use streaming but it seems that I can't
get a process with 0 incoming traffic to stay up. Any advice much
appreciated -- I'm sure someone on this list has managed to run a streaming
program for longer than an hour!


Re: disable log4j for spark-shell

2014-08-27 Thread Yana
You just have to tell Spark which log4j properties file to use. I think
--driver-java-options=-Dlog4j.configuration=log4j.properties should work
but it didn't for me. set
SPARK_JAVA_OPTS=-Dlog4j.configuration=log4j.properties did work though (this
was on Windows, in local mode, assuming you put a file called
log4j.properties in the bin directory where you run the shell from). In
either case, like Sean said -Dlog4j.configuration=... is the magic
incantation, you just have to figure out how to pass it depending on what
version of SPARK you're using ( I personally find setting
PRINT_SPARK_LAUNCH_COMMAND=1 very very useful when I'm trying to track a
property)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p12942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Historic data and clocks

2014-08-27 Thread Frank van Lankvelt
Hi,

In an attempt to keep processing logic as simple as possible, I'm trying to
use spark streaming for processing historic as well as real-time data.
 This works quite well, using big intervals that match the window size for
historic data, and small intervals for real-time.

I found this discussion on a (very) similar situation:
https://groups.google.com/forum/#!topic/spark-users/ES8X1l_xn5s

To be able to do this historic data processing, I need access to the
ManualClock that's used by the job generator.  That's unfortunately only
accessible with reflection.

Since I couldn't find JIRA or GitHub issues, I wonder if there's perhaps a
better alternative solution, or if it would be worthwhile improving support
for manual time management?

cheers, Frank

-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com


Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Frank van Lankvelt
you could try looking at ScalaCL[1], it's targeting OpenCL rather than
CUDA, but that might be close enough?

cheers, Frank

1. https://github.com/ochafik/ScalaCL


On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote:

 Thank you all. Actually I was looking at JCUDA. Function wise this may be
 a perfect solution to offload computation to GPU. Will see how performance
 it will be, especially with the Java binding.

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan



 From:Chen He airb...@gmail.com
 To:Antonio Jesus Navarro ajnava...@stratio.com,
 Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org,
 Wei Tan/Watson/IBM@IBMUS
 Date:08/27/2014 11:03 AM
 Subject:Re: CUDA in spark, especially in MLlib?
 --



 JCUDA can let you do that in Java
 *http://www.jcuda.org* http://www.jcuda.org/


 On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro 
 *ajnava...@stratio.com* ajnava...@stratio.com wrote:
 Maybe this would interest you:

 CPU and GPU-accelerated Machine Learning Library:

 *https://github.com/BIDData/BIDMach* https://github.com/BIDData/BIDMach


 2014-08-27 4:08 GMT+02:00 Matei Zaharia *matei.zaha...@gmail.com*
 matei.zaha...@gmail.com:

 You should try to find a Java-based library, then you can call it from
 Scala.

 Matei

 On August 26, 2014 at 6:58:11 PM, Wei Tan (*w...@us.ibm.com*
 w...@us.ibm.com) wrote:

 Hi I am trying to find a CUDA library in Scala, to see if some matrix
 manipulation in MLlib can be sped up.

 I googled a few but found no active projects on Scala+CUDA. Python is
 supported by CUDA though. Any suggestion on whether this idea makes any
 sense?

 Best regards,
 Wei






-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com


SchemaRDD

2014-08-27 Thread Koert Kuipers
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?


Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello,

If I do:

DStream transform {
rdd.zipWithIndex.map {

Is the index guaranteed to be unique across all RDDs here?

}
}

Thanks,
-Soumitra.


Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui

On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
 Hello,

 If I do:

 DStream transform {
 rdd.zipWithIndex.map {

 Is the index guaranteed to be unique across all RDDs here?

 }
 }

 Thanks,
 -Soumitra.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



minPartitions ignored for bz2?

2014-08-27 Thread jerryye
Hi,
I'm running on the master branch and I noticed that textFile ignores
minPartition for bz2 files. Is anyone else seeing the same thing? I tried
varying minPartitions for a bz2 file and rdd.partitions.size was always 1
whereas doing it for a non-bz2 file worked. 

Not sure if this matters or not but I'm reading from s3.

- jerry



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/minPartitions-ignored-for-bz2-tp12948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar.

Is there a way to get unique index?


On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:

 No. The indices start at 0 for every RDD. -Xiangrui

 On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  Hello,
 
  If I do:
 
  DStream transform {
  rdd.zipWithIndex.map {
 
  Is the index guaranteed to be unique across all RDDs here?
 
  }
  }
 
  Thanks,
  -Soumitra.



Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
You can use RDD id as the seed, which is unique in the same spark
context. Suppose none of the RDDs would contain more than 1 billion
records. Then you can use

rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

Just a hack ..

On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
 So, I guess zipWithUniqueId will be similar.

 Is there a way to get unique index?


 On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:

 No. The indices start at 0 for every RDD. -Xiangrui

 On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  Hello,
 
  If I do:
 
  DStream transform {
  rdd.zipWithIndex.map {
 
  Is the index guaranteed to be unique across all RDDs here?
 
  }
  }
 
  Thanks,
  -Soumitra.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: minPartitions ignored for bz2?

2014-08-27 Thread Xiangrui Meng
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files
before 1.2 (or a later version). But due to a bug
(https://issues.apache.org/jira/browse/HADOOP-10614), you should try
hadoop-2.5.0. -Xiangrui

On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote:
 Hi,
 I'm running on the master branch and I noticed that textFile ignores
 minPartition for bz2 files. Is anyone else seeing the same thing? I tried
 varying minPartitions for a bz2 file and rdd.partitions.size was always 1
 whereas doing it for a non-bz2 file worked.

 Not sure if this matters or not but I'm reading from s3.

 - jerry



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/minPartitions-ignored-for-bz2-tp12948.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD

2014-08-27 Thread Matei Zaharia
I think this will increasingly be its role, though it doesn't make sense to use 
it to core because it is clearly just a client of the core APIs. What usage do 
you have in mind in particular? It would be nice to know how the non-SQL APIs 
for this could be better.

Matei

On August 27, 2014 at 2:31:45 PM, Koert Kuipers (ko...@tresata.com) wrote:

i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?


Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Thanks.

Just to double check, rdd.id would be unique for a batch in a DStream?


On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:

 You can use RDD id as the seed, which is unique in the same spark
 context. Suppose none of the RDDs would contain more than 1 billion
 records. Then you can use

 rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

 Just a hack ..

 On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  So, I guess zipWithUniqueId will be similar.
 
  Is there a way to get unique index?
 
 
  On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
 
  No. The indices start at 0 for every RDD. -Xiangrui
 
  On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
  kumar.soumi...@gmail.com wrote:
   Hello,
  
   If I do:
  
   DStream transform {
   rdd.zipWithIndex.map {
  
   Is the index guaranteed to be unique across all RDDs here?
  
   }
   }
  
   Thanks,
   -Soumitra.
 
 



Re: How to get prerelease thriftserver working?

2014-08-27 Thread Cheng Lian
Hey Matt, if you want to access existing Hive data, you still need a to run
a Hive metastore service, and provide a proper hive-site.xml (just drop it
in $SPARK_HOME/conf).

Could you provide the error log you saw?
​


On Wed, Aug 27, 2014 at 12:09 PM, Michael Armbrust mich...@databricks.com
wrote:

 I would expect that to work.  What exactly is the error?


 On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu m...@kabam.com wrote:

 (apologies for sending this twice, first via nabble; didn't realize it
 wouldn't get forwarded)

 Hey, I know it's not officially released yet, but I'm trying to
 understand (and run) the Thrift-based JDBC server, in order to enable
 remote JDBC access to our dev cluster.

 Before asking about details, is my understanding of this correct?
 `sbin/start-thriftserver` is a JDBC/Hive server that doesn't require
 running a Hive+MR cluster (i.e. just Spark/Spark+YARN)?

 Assuming yes, I have hope that it all basically works, just that some
 documentation needs to be cleaned up:

 - I found a release page implying that 1.1 will be released pretty
 soon-ish:
 https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
 - I can find recent (more recent 30 days or so) activity with promising
 titles: [Updated Spark SQL README to include the hive-thriftserver
 module](https://github.com/apache/spark/pull/1867), [[SPARK-2410][SQL]
 Merging Hive Thrift/JDBC server (with Maven profile fix)](
 https://github.com/apache/spark/pull/1620)

 Am I following all the right email threads, issues trackers, and whatnot?

 Specifically, I tried:

 1. Building off of `branch-1.1`, synced as of ~today (2014 Aug 25)
 2. Running `sbin/start-thriftserver.sh` in `yarn-client` mode
 3. Can see the processing running, and the spark context/app created in
 yarn logs,
 and can connect to the thrift server on the default port of 1 using
 `bin/beeline`
 4. However, when I try to find out what that cluster has via `show
 tables;`, in the logs
 I see a connection error to some (what I assume to be) random port.

 So what service am I forgetting/too ignorant to run? Or did I
 misunderstand and we do need a live Hive instance to back thriftserver? Or
 is this a YARN-specific issue?

 Only recently started learning the ecosystem and community, so apologies
 for the longer post and lots of questions. :)

 Matt





Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD.

On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
 Thanks.

 Just to double check, rdd.id would be unique for a batch in a DStream?


 On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:

 You can use RDD id as the seed, which is unique in the same spark
 context. Suppose none of the RDDs would contain more than 1 billion
 records. Then you can use

 rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

 Just a hack ..

 On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  So, I guess zipWithUniqueId will be similar.
 
  Is there a way to get unique index?
 
 
  On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
 
  No. The indices start at 0 for every RDD. -Xiangrui
 
  On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
  kumar.soumi...@gmail.com wrote:
   Hello,
  
   If I do:
  
   DStream transform {
   rdd.zipWithIndex.map {
  
   Is the index guaranteed to be unique across all RDDs here?
  
   }
   }
  
   Thanks,
   -Soumitra.
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Daniil Osipov
Hello,

I've been seeing the following errors when trying to save to S3:

Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage fail
ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task
4058.3 in stag
e 2.1 (TID 12572, ip-10-81-151-40.ec2.internal):
java.io.FileNotFoundException: /mnt/spa$
k/spark-local-20140827191008-05ae/0c/shuffle_1_7570_5768 (No space left on
device)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.init(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175$

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$
eWriter.scala:67)

org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuff$
eWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65$

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

DF tells me there is plenty of space left on the worker node:
root@ip-10-81-151-40 ~]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  4.6G  3.3G  59% /
tmpfs 7.4G 0  7.4G   0% /dev/shm
/dev/xvdb  37G   11G   25G  30% /mnt
/dev/xvdf  37G  9.5G   26G  27% /mnt2

Any suggestions?
Dan


Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
I see a issue here.

If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG.

I wish there was DStream mapPartitionsWithIndex.


On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:

 You can use RDD id as the seed, which is unique in the same spark
 context. Suppose none of the RDDs would contain more than 1 billion
 records. Then you can use

 rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

 Just a hack ..

 On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  So, I guess zipWithUniqueId will be similar.
 
  Is there a way to get unique index?
 
 
  On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
 
  No. The indices start at 0 for every RDD. -Xiangrui
 
  On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
  kumar.soumi...@gmail.com wrote:
   Hello,
  
   If I do:
  
   DStream transform {
   rdd.zipWithIndex.map {
  
   Is the index guaranteed to be unique across all RDDs here?
  
   }
   }
  
   Thanks,
   -Soumitra.
 
 



SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
Hi, Michael.

I used HiveContext to create a table with a field of type Array. However, in 
the hql results, this field was returned as type ArrayBuffer which is mutable. 
Would it make more sense to be an Array?

The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor 
newer version of Spark yet.

Thanks,
Du




Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread Yana
I'm not so sure that your error is coming from the cassandra write.

you have val data = test.map(..).map(..)

so data will actually not get created until you try to save it. Can you try
to do something like data.count() or data.take(k) after this line and see if
you even get to the cassandra part? My suspicion is that you're trying to
access something (SparkConf?) within the map closures...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12960.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
Arrays in the JVM are also mutable.  However, you should not be relying on
the exact type here.  The only promise is that you will get back something
of type Seq[_].


On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.com wrote:

  Hi, Michael.

  I used HiveContext to create a table with a field of type Array.
 However, in the hql results, this field was returned as type ArrayBuffer
 which is mutable. Would it make more sense to be an Array?

  The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext
 nor newer version of Spark yet.

  Thanks,
 Du





Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
I found this discrepancy when writing unit tests for my project. Basically the 
expectation was that the returned type should match that of the input data. 
Although it’s easy to work around, I was just feeling a bit weird. Is there a 
better reason to return ArrayBuffer?


From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Wednesday, August 27, 2014 at 5:21 PM
To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL returns ArrayBuffer for fields of type Array

Arrays in the JVM are also mutable.  However, you should not be relying on the 
exact type here.  The only promise is that you will get back something of type 
Seq[_].


On Wed, Aug 27, 2014 at 4:27 PM, Du Li 
l...@yahoo-inc.commailto:l...@yahoo-inc.com wrote:
Hi, Michael.

I used HiveContext to create a table with a field of type Array. However, in 
the hql results, this field was returned as type ArrayBuffer which is mutable. 
Would it make more sense to be an Array?

The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor 
newer version of Spark yet.

Thanks,
Du





Kafka stream receiver stops input

2014-08-27 Thread Tim Smith
Hi,

I have Spark (1.0.0 on CDH5) running with Kafka 0.8.1.1.

I have a streaming jobs that reads from a kafka topic and writes
output to another kafka topic. The job starts fine but after a while
the input stream stops getting any data. I think these messages show
no incoming data on the stream:
14/08/28 00:42:15 INFO ReceiverTracker: Stream 0 received 0 blocks

I run the job as:
spark-submit --class logStreamNormalizer --master yarn
log-stream-normalizer_2.10-1.0.jar --jars
spark-streaming-kafka_2.10-1.0.2.jar,kafka_2.10-0.8.1.1.jar,zkclient-0.jar,metrics-core-2.2.0.jar,json4s-jackson_2.10-3.2.10.jar
--executor-memory 6G --spark.cleaner.ttl 60 --executor-cores 4

As soon as I start the job, I see an error like:

14/08/28 00:50:59 INFO BlockManagerInfo: Added input-0-1409187056800
in memory on node6-acme.com:39418 (size: 83.3 MB, free: 3.1 GB)
Exception in thread pool-1-thread-7 java.lang.OutOfMemoryError: Java
heap space
at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:85)
at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
at 
org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:42)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

But not sure if that is the cause because even after that OOM message,
I see data coming in:
14/08/28 00:51:00 INFO ReceiverTracker: Stream 0 received 6 blocks

Appreciate any pointers or suggestions to troubleshoot the issue.

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
In general the various language interfaces try to return the natural type
for the language.  In python we return lists in scala we return Seqs.
 Arrays on the JVM have all sorts of messy semantics (e.g. they are
invariant and don't have erasure).


On Wed, Aug 27, 2014 at 5:34 PM, Du Li l...@yahoo-inc.com wrote:

   I found this discrepancy when writing unit tests for my project.
 Basically the expectation was that the returned type should match that of
 the input data. Although it’s easy to work around, I was just feeling a bit
 weird. Is there a better reason to return ArrayBuffer?


   From: Michael Armbrust mich...@databricks.com
 Date: Wednesday, August 27, 2014 at 5:21 PM
 To: Du Li l...@yahoo-inc.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: SparkSQL returns ArrayBuffer for fields of type Array

   Arrays in the JVM are also mutable.  However, you should not be relying
 on the exact type here.  The only promise is that you will get back
 something of type Seq[_].


 On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.com wrote:

  Hi, Michael.

  I used HiveContext to create a table with a field of type Array.
 However, in the hql results, this field was returned as type ArrayBuffer
 which is mutable. Would it make more sense to be an Array?

  The Spark version of my test is 1.0.2. I haven’t tested it on
 SQLContext nor newer version of Spark yet.

  Thanks,
 Du






Re: MLBase status

2014-08-27 Thread Ameet Talwalkar
Hi Sameer,

MLbase started out as a set of three ML components on top of Spark.  The
lowest level, MLlib, is now a rapidly growing component within Spark and is
maintained by the Spark community. The two higher-level components (MLI and
MLOpt) are experimental components that serve as testbeds for MLlib.

A prototype of MLI was released last fall, and since then many ideas from
this prototype and the associated paper
http://www.cs.berkeley.edu/~ameet/mlinterface_icdm.pdf have been (or are
in the process of being) ported to MLlib.

MLOpt, which ultimately aims to automate the task of ML pipeline
construction, is an active area of research in the AMPLab.  We are still
exploring various research ideas, and given the nature of research, it's
not clear when we will be ready to release any code.

For more details check out mlbase.org, my talk
https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
at the Spark Summit about MLlib (slides 5, 6, 33, 34 describe MLbase), and
Evan's slides
http://spark-summit.org/wp-content/uploads/2014/07/Ghostface-Towards-an-Optimizer-for-MLBase-Sparks-Talkwalkar-Franklin-Jordan-Kraska1.pdf
on the current status of MLOpt.

-Ameet


On Wed, Aug 27, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote:

 Hi All,
 I was wondering can someone please tell me the status of MLbase and its
 roadmap in terms of software release. We are very interested in exploring
 it for our applications.



Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Dibyendu Bhattacharya
I agree. This issue should be fixed in Spark rather rely on replay of Kafka
messages.

Dib
On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote:

 Dibyendu,

 Tnks for getting back.

 I believe you are absolutely right. We were under the assumption that the
 raw data was being computed again and that's not happening after further
 tests. This applies to Kafka as well.

 The issue is of major priority fortunately.

 Regarding your suggestion, I would maybe prefer to have the problem
 resolved
 within Spark's internals since once the data is replicated we should be
 able
 to access it once more and not having to pool it back again from Kafka or
 any other stream that is being affected by this issue. If for example there
 is a big amount of batches to be recomputed I would rather have them done
 distributed than overloading the batch interval with huge amount of Kafka
 messages.

 I do not have yet enough know how on where is the issue and about the
 internal Spark code so I can't really how much difficult will be the
 implementation.

 tnks,
 Rod



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12966.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi,

I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 
0.98, 

My steps:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2

edit project/SparkBuild.scala, set HBASE_VERSION
  // HBase version; set as appropriate.
  val HBASE_VERSION = 0.98.2


edit pom.xml with following values
hadoop.version2.4.1/hadoop.version
protobuf.version2.5.0/protobuf.version
yarn.version${hadoop.version}/yarn.version
hbase.version0.98.5/hbase.version
zookeeper.version3.4.6/zookeeper.version
hive.version0.13.1/hive.version


SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
set HBASE_VERSION back to “0.94.6?

Regards
Arthur




[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.hbase#hbase;0.98.2: not found
[warn]  ::

sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not 
found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:125)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
[error] (examples/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.hbase#hbase;0.98.2: not found
[error] 

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
See SPARK-1297

The pull request is here:
https://github.com/apache/spark/pull/1893


On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please
 ignore if duplicated)


 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with
 HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or
 should I set HBASE_VERSION back to “0.94.6?

 Regards
 Arthur




 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::

 sbt.ResolveException: unresolved dependency:
 org.apache.hbase#hbase;0.98.2: not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at
 xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
 at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
 at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
 at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
 at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
 at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
 at
 sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
 at sbt.std.Transform$$anon$4.work(System.scala:64)
 at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
 at sbt.Execute.work(Execute.scala:244)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at
 sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
 at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted,

Thank you so much!!

As I am new to Spark, can you please advise the steps about how to apply this 
patch to my spark-1.0.2 source folder?

Regards
Arthur


On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)
 
 
 Hi,
 
 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
 HBase 0.98,
 
 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 
 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2
 
 
 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version
 
 
 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2
 
 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
 set HBASE_VERSION back to “0.94.6?
 
 Regards
 Arthur
 
 
 
 
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::
 
 sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
 not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at 
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at 
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
 at 
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
 at 
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
 at 
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
 at 
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
 at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
 at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
 at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
 at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
 at sbt.std.Transform$$anon$4.work(System.scala:64)
 at 
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at 
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
 at sbt.Execute.work(Execute.scala:244)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at 
 sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
 at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at 

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Henry Hung
Update:

I use shell script to execute the spark-shell, inside the my-script.sh:
$SPARK_HOME/bin/spark-shell  $HOME/test.scala  $HOME/test.log 21 

Although it correctly finish the println(hallo world), but the strange thing 
is that my-script.sh finished before spark-shell even finish executing the 
script.

Best regards,
Henry Hung

From: MA33 YTHung1
Sent: Thursday, August 28, 2014 10:01 AM
To: user@spark.apache.org
Subject: how to correctly run scala script using spark-shell through stdin 
(spark v1.0.0)

HI All,

Right now I'm trying to execute a script using this command:

nohup $SPARK_HOME/bin/spark-shell  $HOME/my-script.scala  $HOME/my-script.log 
21 

my-script.scala just have 1 line of code:
println(hallo world)

But after waiting for a minute, I still don't receive the result from 
spark-shell.
And the output log only contains:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data1/hadoop/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data1/hadoop/hbase-0.96.0-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/08/28 09:38:27 INFO spark.SecurityManager: Changing view acls to: hadoop
14/08/28 09:38:27 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop)
14/08/28 09:38:27 INFO spark.HttpServer: Starting HTTP Server
14/08/28 09:38:27 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/08/28 09:38:27 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:49382mailto:SocketConnector@0.0.0.0:49382

My question is:
What is the right way to execute spark script?
I really want to use spark-shell as a way to run cron job in the future, just 
like when I run R script.

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
You can get the patch from this URL:
https://github.com/apache/spark/pull/1893.patch

BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml

Cheers


On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thank you so much!!

 As I am new to Spark, can you please advise the steps about how to apply
 this patch to my spark-1.0.2 source folder?

 Regards
 Arthur


 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297

 The pull request is here:
 https://github.com/apache/spark/pull/1893


 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please
 ignore if duplicated)


 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with
 HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or
 should I set HBASE_VERSION back to “0.94.6?

 Regards
 Arthur




 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::

 sbt.ResolveException: unresolved dependency:
 org.apache.hbase#hbase;0.98.2: not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at
 xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
 at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
 at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
 at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
 at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
 at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
 at
 sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
 at sbt.std.Transform$$anon$4.work(System.scala:64)
 at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
 at sbt.Execute.work(Execute.scala:244)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at
 

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted,

I tried the following steps to apply the patch 1893 but got Hunk FAILED, can 
you please advise how to get thru this error? or is my spark-1.0.2 source not 
the correct one?

Regards
Arthur
 
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2
wget https://github.com/apache/spark/pull/1893.patch
patch   1893.patch
patching file pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
patching file pom.xml
Hunk #1 FAILED at 54.
Hunk #2 FAILED at 72.
Hunk #3 FAILED at 171.
3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
can't find file to patch at input line 267
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--
|
|From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
|From: tedyu yuzhih...@gmail.com
|Date: Mon, 11 Aug 2014 15:57:46 -0700
|Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
| description to building-with-maven.md
|
|---
| docs/building-with-maven.md | 3 +++
| 1 file changed, 3 insertions(+)
|
|diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- a/docs/building-with-maven.md
|+++ b/docs/building-with-maven.md
--
File to patch:



On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:

 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch
 
 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 Thank you so much!!
 
 As I am new to Spark, can you please advise the steps about how to apply this 
 patch to my spark-1.0.2 source folder?
 
 Regards
 Arthur
 
 
 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:
 
 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)
 
 
 Hi,
 
 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
 HBase 0.98,
 
 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 
 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2
 
 
 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version
 
 
 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2
 
 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should 
 I set HBASE_VERSION back to “0.94.6?
 
 Regards
 Arthur
 
 
 
 
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::
 
 sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
 not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at 
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at 
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
 at 
 

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
Can you use this command ?

patch -p1 -i 1893.patch

Cheers


On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 I tried the following steps to apply the patch 1893 but got Hunk FAILED,
 can you please advise how to get thru this error? or is my spark-1.0.2
 source not the correct one?

 Regards
 Arthur

 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:



 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:

 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch

 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the
 pom.xml

 Cheers


 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thank you so much!!

 As I am new to Spark, can you please advise the steps about how to apply
 this patch to my spark-1.0.2 source folder?

 Regards
 Arthur


 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297

  The pull request is here:
 https://github.com/apache/spark/pull/1893


 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please
 ignore if duplicated)


 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2
 with HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or
 should I set HBASE_VERSION back to “0.94.6?

 Regards
 Arthur




 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::

 sbt.ResolveException: unresolved dependency:
 org.apache.hbase#hbase;0.98.2: not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at
 xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at 

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, 

Thanks. 

Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
Is this normal?

Regards
Arthur


patch -p1 -i 1893.patch
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 succeeded at 94 (offset -16 lines).
1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
patching file examples/pom.xml
Hunk #1 FAILED at 54.
Hunk #2 FAILED at 72.
Hunk #3 succeeded at 122 (offset -49 lines).
2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 succeeded at 122 (offset -40 lines).
Hunk #2 succeeded at 195 (offset -40 lines).


On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

 Can you use this command ?
 
 patch -p1 -i 1893.patch
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 I tried the following steps to apply the patch 1893 but got Hunk FAILED, can 
 you please advise how to get thru this error? or is my spark-1.0.2 source not 
 the correct one?
 
 Regards
 Arthur
  
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:
 
 
 
 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:
 
 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch
 
 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 Thank you so much!!
 
 As I am new to Spark, can you please advise the steps about how to apply 
 this patch to my spark-1.0.2 source folder?
 
 Regards
 Arthur
 
 
 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:
 
 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)
 
 
 Hi,
 
 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
 HBase 0.98,
 
 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 
 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2
 
 
 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version
 
 
 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2
 
 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should 
 I set HBASE_VERSION back to “0.94.6?
 
 Regards
 Arthur
 
 
 
 
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::
 
 sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
 not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at 

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Matei Zaharia
You can use spark-shell -i file.scala to run that. However, that keeps the 
interpreter open at the end, so you need to make your file end with 
System.exit(0) (or even more robustly, do stuff in a try {} and add that in 
finally {}).

In general it would be better to compile apps and run them with spark-submit. 
The Scala shell isn't that easy for debugging, etc.

Matei

On August 27, 2014 at 7:23:10 PM, Henry Hung (ythu...@winbond.com) wrote:

Update:

 

I use shell script to execute the spark-shell, inside the my-script.sh:

$SPARK_HOME/bin/spark-shell  $HOME/test.scala  $HOME/test.log 21 

 

Although it correctly finish the println(“hallo world”), but the strange thing 
is that my-script.sh finished before spark-shell even finish executing the 
script.

 

Best regards,

Henry Hung

 

From: MA33 YTHung1
Sent: Thursday, August 28, 2014 10:01 AM
To: user@spark.apache.org
Subject: how to correctly run scala script using spark-shell through stdin 
(spark v1.0.0)

 

HI All,

 

Right now I’m trying to execute a script using this command:

 

nohup $SPARK_HOME/bin/spark-shell  $HOME/my-script.scala  $HOME/my-script.log 
21 

 

my-script.scala just have 1 line of code:

println(“hallo world”)

 

But after waiting for a minute, I still don’t receive the result from 
spark-shell.

And the output log only contains:

 

Spark assembly has been built with Hive, including Datanucleus jars on classpath

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in 
[jar:file:/data1/hadoop/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/data1/hadoop/hbase-0.96.0-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

14/08/28 09:38:27 INFO spark.SecurityManager: Changing view acls to: hadoop

14/08/28 09:38:27 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop)

14/08/28 09:38:27 INFO spark.HttpServer: Starting HTTP Server

14/08/28 09:38:27 INFO server.Server: jetty-8.y.z-SNAPSHOT

14/08/28 09:38:27 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:49382

 

My question is:

What is the right way to execute spark script?

I really want to use spark-shell as a way to run cron job in the future, just 
like when I run R script.

 

Best regards,

Henry

 

The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.


Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi,

I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1.

I just tried to compile Spark 1.0.2, but got error on Spark Project Hive, can 
you please advise which repository has 
org.spark-project.hive:hive-metastore:jar:0.13.1?


FYI, below is my repository setting in maven which would be old:
repository
   idnexus/id
   namelocal private nexus/name
   urlhttp://maven.oschina.net/content/groups/public//url
   releases
 enabledtrue/enabled
   /releases
   snapshots
 enabledfalse/enabled
   /snapshots
 /repository


export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package

[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.892s]
[INFO] Spark Project Core  SUCCESS [1:33.698s]
[INFO] Spark Project Bagel ... SUCCESS [12.270s]
[INFO] Spark Project GraphX .. SUCCESS [2:16.343s]
[INFO] Spark Project ML Library .. SUCCESS [4:18.495s]
[INFO] Spark Project Streaming ... SUCCESS [39.765s]
[INFO] Spark Project Tools ... SUCCESS [9.173s]
[INFO] Spark Project Catalyst  SUCCESS [35.462s]
[INFO] Spark Project SQL . SUCCESS [1:16.118s]
[INFO] Spark Project Hive  FAILURE [1:36.816s]
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Parent POM . SKIPPED
[INFO] Spark Project YARN Stable API . SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 12:40.616s
[INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014
[INFO] Final Memory: 37M/826M
[INFO] 
[ERROR] Failed to execute goal on project spark-hive_2.10: Could not resolve 
dependencies for project org.apache.spark:spark-hive_2.10:jar:1.0.2: The 
following artifacts could not be resolved: 
org.spark-project.hive:hive-metastore:jar:0.13.1, 
org.spark-project.hive:hive-exec:jar:0.13.1, 
org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact 
org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc 
(http://maven.oschina.net/content/groups/public/) - [Help 1]


Regards
Arthur



Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
Looks like the patch given by that URL only had the last commit.

I have attached pom.xml for spark-1.0.2 to SPARK-1297
You can download it and replace examples/pom.xml with the downloaded pom

I am running this command locally:

mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package

Cheers


On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thanks.

 Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
 Is this normal?

 Regards
 Arthur


 patch -p1 -i 1893.patch
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file examples/pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 succeeded at 122 (offset -49 lines).
 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 succeeded at 122 (offset -40 lines).
 Hunk #2 succeeded at 195 (offset -40 lines).


 On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

 Can you use this command ?

 patch -p1 -i 1893.patch

 Cheers


 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 I tried the following steps to apply the patch 1893 but got Hunk FAILED,
 can you please advise how to get thru this error? or is my spark-1.0.2
 source not the correct one?

 Regards
 Arthur

 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:



 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:

 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch

 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the
 pom.xml

 Cheers


 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thank you so much!!

 As I am new to Spark, can you please advise the steps about how to apply
 this patch to my spark-1.0.2 source folder?

 Regards
 Arthur


 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297

  The pull request is here:
 https://github.com/apache/spark/pull/1893


 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please
 ignore if duplicated)


 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2
 with HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or
 should I set HBASE_VERSION back to “0.94.6?

 Regards
 Arthur




 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::

 sbt.ResolveException: unresolved dependency:
 org.apache.hbase#hbase;0.98.2: not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at 

Re: Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread Ted Yu
See this thread:
http://search-hadoop.com/m/JW1q5wwgyL1/Working+Formula+for+Hive+0.13subj=Re+Working+Formula+for+Hive+0+13+


On Wed, Aug 27, 2014 at 8:54 PM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1.

 I just tried to compile Spark 1.0.2, but got error on Spark Project
 Hive, can you please advise which repository has 
 org.spark-project.hive:hive-metastore:jar:0.13.1?


 FYI, below is my repository setting in maven which would be old:
 repository
idnexus/id
namelocal private nexus/name
urlhttp://maven.oschina.net/content/groups/public//url
releases
  enabledtrue/enabled
/releases
snapshots
  enabledfalse/enabled
/snapshots
  /repository


 export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512m
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package

 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS [1.892s]
 [INFO] Spark Project Core  SUCCESS
 [1:33.698s]
 [INFO] Spark Project Bagel ... SUCCESS
 [12.270s]
 [INFO] Spark Project GraphX .. SUCCESS
 [2:16.343s]
 [INFO] Spark Project ML Library .. SUCCESS
 [4:18.495s]
 [INFO] Spark Project Streaming ... SUCCESS
 [39.765s]
 [INFO] Spark Project Tools ... SUCCESS [9.173s]
 [INFO] Spark Project Catalyst  SUCCESS
 [35.462s]
 [INFO] Spark Project SQL . SUCCESS
 [1:16.118s]
 [INFO] Spark Project Hive  FAILURE
 [1:36.816s]
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project YARN Stable API . SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 12:40.616s
 [INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014
 [INFO] Final Memory: 37M/826M
 [INFO]
 
 [ERROR] Failed to execute goal on project spark-hive_2.10: Could not
 resolve dependencies for project
 org.apache.spark:spark-hive_2.10:jar:1.0.2: The following artifacts could
 not be resolved: org.spark-project.hive:hive-metastore:jar:0.13.1,
 org.spark-project.hive:hive-exec:jar:0.13.1,
 org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact
 org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc (
 http://maven.oschina.net/content/groups/public/) - [Help 1]


 Regards
 Arthur




Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread lmk
Hi Yana
I have done take and confirmed existence of data..Also checked that it is
getting connected to Cassandra.. That is why I suspect that this particular
rdd is not serializable..
Thanks,
Lmk
On Aug 28, 2014 5:13 AM, Yana [via Apache Spark User List] 
ml-node+s1001560n12960...@n3.nabble.com wrote:

 I'm not so sure that your error is coming from the cassandra write.

 you have val data = test.map(..).map(..)

 so data will actually not get created until you try to save it. Can you
 try to do something like data.count() or data.take(k) after this line and
 see if you even get to the cassandra part? My suspicion is that you're
 trying to access something (SparkConf?) within the map closures...

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12960.html
  To unsubscribe from Apache Spark- Cassandra - NotSerializable Exception
 while saving to cassandra, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12906code=bGFrc2htaS5tdXJhbGlrcmlzaG5hbkBnbWFpbC5jb218MTI5MDZ8LTEzOTI0NzEwNjA=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12984.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
I forgot to include '-Dhadoop.version=2.4.1' in the command below.

The modified command passed.

You can verify the dependence on hbase 0.98 through this command:

mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
dependency:tree  dep.txt

Cheers


On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the patch given by that URL only had the last commit.

 I have attached pom.xml for spark-1.0.2 to SPARK-1297
 You can download it and replace examples/pom.xml with the downloaded pom

 I am running this command locally:

 mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package

 Cheers


 On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thanks.

 Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
 Is this normal?

 Regards
 Arthur


 patch -p1 -i 1893.patch
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file examples/pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 succeeded at 122 (offset -49 lines).
 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 succeeded at 122 (offset -40 lines).
 Hunk #2 succeeded at 195 (offset -40 lines).


 On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

 Can you use this command ?

 patch -p1 -i 1893.patch

 Cheers


 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 I tried the following steps to apply the patch 1893 but got Hunk FAILED,
 can you please advise how to get thru this error? or is my spark-1.0.2
 source not the correct one?

 Regards
 Arthur

 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:



 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:

 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch

 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the
 pom.xml

 Cheers


 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thank you so much!!

 As I am new to Spark, can you please advise the steps about how to
 apply this patch to my spark-1.0.2 source folder?

 Regards
 Arthur


 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297

  The pull request is here:
 https://github.com/apache/spark/pull/1893


 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” ,
 please ignore if duplicated)


 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2
 with HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or
 should I set HBASE_VERSION back to “0.94.6?

 Regards
 Arthur




 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: 

Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for
success cases in pig test suite. That means UDF, Joins  other
functionality is working quite nicely. We are in the process of merging
with Apache Pig trunk(something that should happen over the next 2 weeks).
Meanwhile if you are interested in giving it a go, you can try it at
https://github.com/sigmoidanalytics/spork
This contains all the major changes but may not have all the patches
required for 100% e2e, if you are trying it out let me know any issues you
face

Whole bunch of folks contributed on this

Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid
Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga
(Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket Mokashi
 (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics),
Mahesh Kalakoti (Sigmoid Analytics)

Not to mention Spark  Pig communities.

Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together.

Matei

On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) 
wrote:

Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for success 
cases in pig test suite. That means UDF, Joins  other functionality is working 
quite nicely. We are in the process of merging with Apache Pig trunk(something 
that should happen over the next 2 weeks). 
Meanwhile if you are interested in giving it a go, you can try it at 
https://github.com/sigmoidanalytics/spork
This contains all the major changes but may not have all the patches required 
for 100% e2e, if you are trying it out let me know any issues you face

Whole bunch of folks contributed on this 

Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid 
Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga 
(Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket Mokashi  
(Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh 
Kalakoti (Sigmoid Analytics)

Not to mention Spark  Pig communities. 

Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



Submitting multiple files pyspark

2014-08-27 Thread Chengi Liu
Hi,
  I have two files..

main_app.py and helper.py
main_app.py calls some functions in helper.py.
I want to use spark-submit to submit a job but how do i specify helper.py?
Basically, how do i specify multiple files in spark?
Thanks