Re: combineByKey throws ClassCastException

2014-09-14 Thread x
How about this.

scala> val rdd2 = rdd.combineByKey(
 | (v: Int) => v.toLong,
 | (c: Long, v: Int) => c + v,
 | (c1: Long, c2: Long) => c1 + c2)
rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at
combineB
yKey at :14

xj @ Tokyo

On Mon, Sep 15, 2014 at 3:06 PM, Tao Xiao  wrote:

> I followd an example presented in the tutorial Learning Spark
> 
> to compute the per-key average as follows:
>
>
> val Array(appName) = args
> val sparkConf = new SparkConf()
> .setAppName(appName)
> val sc = new SparkContext(sparkConf)
> /*
>  * compute the per-key average of values
>  * results should be:
>  *A : 5.8
>  *B : 14
>  *C : 60.6
>  */
> val rdd = sc.parallelize(List(
> ("A", 3), ("A", 9), ("A", 12), ("A", 0), ("A", 5),
> ("B", 4), ("B", 10), ("B", 11), ("B", 20), ("B", 25),
> ("C", 32), ("C", 91), ("C", 122), ("C", 3), ("C", 55)), 2)
> val avg = rdd.combineByKey(
> (x:Int) => (x, 1),  // java.lang.ClassCastException: scala.Tuple2$mcII$sp
> cannot be cast to java.lang.Integer
> (acc:(Int, Int), x) => (acc._1 + x, acc._2 + 1),
> (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 +
> acc2._2))
> .map{case (s, t) => (s, t._1/t._2.toFloat)}
>  avg.collect.foreach(t => println(t._1 + " ->" + t._2))
>
>
>
> When I submitted the application, an exception of 
> "*java.lang.ClassCastException:
> scala.Tuple2$mcII$sp cannot be cast to java.lang.Integer*" was thrown
> out. The tutorial said that the first function of *combineByKey*, *(x:Int)
> => (x, 1)*, should take a single element in the source RDD and return an
> element of the desired type in the resulting RDD. In my application, we
> take a single element of type *Int *from the source RDD and return a
> tuple of type (*Int*, *Int*), which meets the requirements quite well.
> But why would such an exception be thrown?
>
> I'm using CDH 5.0 and Spark 0.9
>
> Thanks.
>
>
>


SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-14 Thread linkpatrickliu
I started sparkSQL thrift server:
"sbin/start-thriftserver.sh"

Then I use beeline to connect to it:
"bin/beeline"
"!connect jdbc:hive2://localhost:1 op1 op1"

I have created a database for user op1.
"create database dw_op1";

And grant all privileges to user op1;
"grant all on database dw_op1 to user op1";

Then I create a table:
"create tabel src(key int, value string)"

Now, I want to load data into this table:
"load data inpath "kv1.txt" into table src"; (kv1.txt is located in the
/user/op1 directory in hdfs)

However, the client will hang...

The log in the thrift server:
14/09/15 14:21:25 INFO Driver: 


Then I ctrl-C to stop the beeline client, and restart the beelien client.
Now I want to drop the table src in dw_op1;
"use dw_op1"
"drop table src"

Then, the beeline client is hanging again..
The log in the thrift server:
14/09/15 14:23:27 INFO Driver: 


Anyone can help on this? Many thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accuracy hit in classification with Spark

2014-09-14 Thread jatinpreet
Hi,

I have been able to get the same accuracy with MLlib as Mahout's. The
pre-processing phase of Mahout was the reason  behind the accuracy mismatch.
After studying and applying the same logic in my code, it worked like a
charm.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p14221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



combineByKey throws ClassCastException

2014-09-14 Thread Tao Xiao
I followd an example presented in the tutorial Learning Spark

to compute the per-key average as follows:


val Array(appName) = args
val sparkConf = new SparkConf()
.setAppName(appName)
val sc = new SparkContext(sparkConf)
/*
 * compute the per-key average of values
 * results should be:
 *A : 5.8
 *B : 14
 *C : 60.6
 */
val rdd = sc.parallelize(List(
("A", 3), ("A", 9), ("A", 12), ("A", 0), ("A", 5),
("B", 4), ("B", 10), ("B", 11), ("B", 20), ("B", 25),
("C", 32), ("C", 91), ("C", 122), ("C", 3), ("C", 55)), 2)
val avg = rdd.combineByKey(
(x:Int) => (x, 1),  // java.lang.ClassCastException: scala.Tuple2$mcII$sp
cannot be cast to java.lang.Integer
(acc:(Int, Int), x) => (acc._1 + x, acc._2 + 1),
(acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 +
acc2._2))
.map{case (s, t) => (s, t._1/t._2.toFloat)}
 avg.collect.foreach(t => println(t._1 + " ->" + t._2))



When I submitted the application, an exception of
"*java.lang.ClassCastException:
scala.Tuple2$mcII$sp cannot be cast to java.lang.Integer*" was thrown out.
The tutorial said that the first function of *combineByKey*, *(x:Int) =>
(x, 1)*, should take a single element in the source RDD and return an
element of the desired type in the resulting RDD. In my application, we
take a single element of type *Int *from the source RDD and return a tuple
of type (*Int*, *Int*), which meets the requirements quite well. But why
would such an exception be thrown?

I'm using CDH 5.0 and Spark 0.9

Thanks.


Re: PathFilter for newAPIHadoopFile?

2014-09-14 Thread Nat Padmanabhan
Hi Eric,

Something along the lines of the following should work

val fs = getFileSystem(...) // standard hadoop API call
val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
instance of org.apache.hadoop.fs.PathFilter
val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
classOf[ParquetInputFormat[Something]], classOf[Void],
classOf[SomeAvroType], getConfiguration(...))

You have to do some initializations on ParquetInputFormat such as
AvroReadSetup/AvroWriteSupport etc but that you should be doing
already I am guessing.

Cheers,
Nat


On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
 wrote:
> Hi,
>
> I have a directory structure with parquet+avro data in it. There are a
> couple of administrative files (.foo and/or _foo) that I need to ignore when
> processing this data or Spark tries to read them as containing parquet
> content, which they do not.
>
> How can I set a PathFilter on the FileInputFormat used to construct an RDD?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan
SPARK-1671 looks really promising.

Note that even right now, you don't need to un-cache the existing
table.   You can do something like this:

newAdditionRdd.registerTempTable("table2")
sqlContext.cacheTable("table2")
val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))

When you use "table", it will return you the cached representation, so
that the union executes much faster.

However, there is some unknown slowdown, it's not quite as fast as
what you would expect.

On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian  wrote:
> Ah, I see. So basically what you need is something like cache write through
> support which exists in Shark but not implemented in Spark SQL yet. In
> Shark, when inserting data into a table that has already been cached, the
> newly inserted data will be automatically cached and “union”-ed with the
> existing table content. SPARK-1671 was created to track this feature. We’ll
> work on that.
>
> Currently, as a workaround, instead of doing union at the RDD level, you may
> try cache the new table, union it with the old table and then query the
> union-ed table. The drawbacks is higher code complexity and you end up with
> lots of temporary tables. But the performance should be reasonable.
>
>
> On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur 
> wrote:
>>
>> LittleCode snippet:
>>
>> line1: cacheTable(existingRDDTableName)
>> line2: //some operations which will materialize existingRDD dataset.
>> line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
>> line4: cacheTable(new_existingRDDTableName)
>> line5: //some operation that will materialize new _existingRDD.
>>
>> now, what we expect is in line4 rather than caching both
>> existingRDDTableName and new_existingRDDTableName, it should cache only
>> new_existingRDDTableName. but we cannot explicitly uncache
>> existingRDDTableName because we want the union to use the cached
>> existingRDDTableName. since being lazy new_existingRDDTableName could be
>> materialized later and by then we cant lose existingRDDTableName from cache.
>>
>> What if keep the same name of the new table
>>
>> so, cacheTable(existingRDDTableName)
>> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
>> cacheTable(existingRDDTableName) //might not be needed again.
>>
>> Will our both cases be satisfied, that it uses existingRDDTableName from
>> cache for union and dont duplicate the data in the cache but somehow, append
>> to the older cacheTable.
>>
>> Thanks and Regards,
>>
>>
>> Archit Thakur.
>> Sr Software Developer,
>> Guavus, Inc.
>>
>> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
>>  wrote:
>>>
>>> I think i should elaborate usecase little more.
>>>
>>> So we have UI dashboard whose response time is quite fast as all the data
>>> is
>>> cached. Users query data based on time range and also there is always new
>>> data coming into the system at predefined frequency lets say 1 hour.
>>>
>>> As you said i can uncache tables it will basically drop all data from
>>> memory.
>>> I cannot afford losing my cache even for short interval. As all queries
>>> from
>>> UI will get slow till the time cache loads again. UI response time needs
>>> to
>>> be predictable and shoudl be fast enough so that user does not get
>>> irritated.
>>>
>>> Also i cannot keep two copies of data(till newrdd materialize) into
>>> memory
>>> as it will surpass total available memory in system.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast error

2014-09-14 Thread Chengi Liu
And the thing is code runs just fine if I reduce the number of rows in my
data?

On Sun, Sep 14, 2014 at 8:45 PM, Chengi Liu  wrote:

> I am using spark1.0.2.
> This is my work cluster.. so I can't setup a new version readily...
> But right now, I am not using broadcast ..
>
>
> conf = SparkConf().set("spark.executor.memory",
> "32G").set("spark.akka.frameSize", "1000")
> sc = SparkContext(conf = conf)
> rdd = sc.parallelize(matrix,5)
>
> from pyspark.mllib.clustering import KMeans
> from math import sqrt
> clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
> initializationMode="random")
> def error(point):
> center = clusters.centers[clusters.predict(point)]
> return sqrt(sum([x**2 for x in (point - center)]))
>
> WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
> print "Within Set Sum of Squared Error = " + str(WSSSE)
>
>
> executed by
> spark-submit --master $SPARKURL clustering_example.py  --executor-memory
> 32G  --driver-memory 60G
>
> and the error I see
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o26.trainKMeansModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: All
> masters are unresponsive! Giving up.
> at org.apache.spark.scheduler.DAGScheduler.org
> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> and
> 14/09/14 20:43:30 WARN AppClient$ClientActor: Could not connect to
> akka.tcp://sparkMaster@hostname:7077:
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkMaster@ hostname:7077]
>
> ??
> Any suggestions??
>
>
> On Sun, Sep 14, 2014 at 8:39 PM, Davies Liu  wrote:
>
>> Hey Chengi,
>>
>> What's the version of Spark you are using? It have big improvements
>> about broadcast in 1.1, could you try it?
>>
>> On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu 
>> wrote:
>> > Any suggestions.. I am really blocked on this one
>> >
>> > On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu 
>> wrote:
>> >>
>> >> And when I use sparksubmit script, I get the following error:
>> >>
>> >> py4j.protocol.Py4JJavaError: An error occurred while calling
>> >> o26.trainKMeansModel.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> All
>> >> masters are unresponsive! Giving up.
>> >> at
>> >> org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>> >> at
>> >>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>> >> at scala.Option.foreach(Option.scala:236)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>> >> at
>> >>
>> org.apache.spark.scheduler.DAGSchedulerE

Re: Broadcast error

2014-09-14 Thread Chengi Liu
I am using spark1.0.2.
This is my work cluster.. so I can't setup a new version readily...
But right now, I am not using broadcast ..


conf = SparkConf().set("spark.executor.memory",
"32G").set("spark.akka.frameSize", "1000")
sc = SparkContext(conf = conf)
rdd = sc.parallelize(matrix,5)

from pyspark.mllib.clustering import KMeans
from math import sqrt
clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
initializationMode="random")
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print "Within Set Sum of Squared Error = " + str(WSSSE)


executed by
spark-submit --master $SPARKURL clustering_example.py  --executor-memory
32G  --driver-memory 60G

and the error I see
py4j.protocol.Py4JJavaError: An error occurred while calling
o26.trainKMeansModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org

$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


and
14/09/14 20:43:30 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@hostname:7077:
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkMaster@ hostname:7077]

??
Any suggestions??


On Sun, Sep 14, 2014 at 8:39 PM, Davies Liu  wrote:

> Hey Chengi,
>
> What's the version of Spark you are using? It have big improvements
> about broadcast in 1.1, could you try it?
>
> On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu 
> wrote:
> > Any suggestions.. I am really blocked on this one
> >
> > On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu 
> wrote:
> >>
> >> And when I use sparksubmit script, I get the following error:
> >>
> >> py4j.protocol.Py4JJavaError: An error occurred while calling
> >> o26.trainKMeansModel.
> >> : org.apache.spark.SparkException: Job aborted due to stage failure: All
> >> masters are unresponsive! Giving up.
> >> at
> >> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
> >> at
> >>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
> >> at
> >>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
> >> at
> >>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >> at
> >>
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
> >> at
> >>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> >> at
> >>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> >> at scala.Option.foreach(Option.scala:236)
> >> at
> >>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
> >> at
> >>
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
> >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> >> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> >> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> >> at
> >>
> 

Re: Broadcast error

2014-09-14 Thread Davies Liu
Hey Chengi,

What's the version of Spark you are using? It have big improvements
about broadcast in 1.1, could you try it?

On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu  wrote:
> Any suggestions.. I am really blocked on this one
>
> On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu  wrote:
>>
>> And when I use sparksubmit script, I get the following error:
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o26.trainKMeansModel.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: All
>> masters are unresponsive! Giving up.
>> at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>> My spark submit code is
>>
>> conf = SparkConf().set("spark.executor.memory",
>> "32G").set("spark.akka.frameSize", "1000")
>> sc = SparkContext(conf = conf)
>> rdd = sc.parallelize(matrix,5)
>>
>> from pyspark.mllib.clustering import KMeans
>> from math import sqrt
>> clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
>> initializationMode="random")
>> def error(point):
>> center = clusters.centers[clusters.predict(point)]
>> return sqrt(sum([x**2 for x in (point - center)]))
>>
>> WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
>> print "Within Set Sum of Squared Error = " + str(WSSSE)
>>
>> Which is executed as following:
>> spark-submit --master $SPARKURL clustering_example.py  --executor-memory
>> 32G  --driver-memory 60G
>>
>> On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu 
>> wrote:
>>>
>>> How? Example please..
>>> Also, if I am running this in pyspark shell.. how do i configure
>>> spark.akka.frameSize ??
>>>
>>>
>>> On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das 
>>> wrote:

 When the data size is huge, you better of use the
 torrentBroadcastFactory.

 Thanks
 Best Regards

 On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu 
 wrote:
>
> Specifically the error I see when I try to operate on rdd created by
> sc.parallelize method
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Serialized task 12:12 was 12062263 bytes which exceeds 
> spark.akka.frameSize
> (10485760 bytes). Consider using broadcast variables for large values.
>
> On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu 
> wrote:
>>
>> Hi,
>>I am trying to create an rdd out of large matrix sc.parallelize
>> suggest to use broadcast
>> But when I do
>>
>> sc.broadcast(data)
>> I get this error:
>>
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line
>> 370, in broadcast
>> pickled = pickleSer.dumps(value)
>>   File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py",
>> line 279, in dumps
>> def dumps(self, obj): return cPickle.dumps(obj, 2)
>> SystemError: error return without exception set
>> Help?
>>
>

>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast error

2014-09-14 Thread Chengi Liu
Any suggestions.. I am really blocked on this one

On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu  wrote:

> And when I use sparksubmit script, I get the following error:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o26.trainKMeansModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: All
> masters are unresponsive! Giving up.
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> My spark submit code is
>
> conf = SparkConf().set("spark.executor.memory",
> "32G").set("spark.akka.frameSize", "1000")
> sc = SparkContext(conf = conf)
> rdd = sc.parallelize(matrix,5)
>
> from pyspark.mllib.clustering import KMeans
> from math import sqrt
> clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
> initializationMode="random")
> def error(point):
> center = clusters.centers[clusters.predict(point)]
> return sqrt(sum([x**2 for x in (point - center)]))
>
> WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
> print "Within Set Sum of Squared Error = " + str(WSSSE)
>
> Which is executed as following:
> spark-submit --master $SPARKURL clustering_example.py  --executor-memory
> 32G  --driver-memory 60G
>
> On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu 
> wrote:
>
>> How? Example please..
>> Also, if I am running this in pyspark shell.. how do i configure
>> spark.akka.frameSize ??
>>
>>
>> On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das 
>> wrote:
>>
>>> When the data size is huge, you better of use the
>>> torrentBroadcastFactory.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu 
>>> wrote:
>>>
 Specifically the error I see when I try to operate on rdd created by
 sc.parallelize method
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
 (10485760 bytes). Consider using broadcast variables for large values.

 On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu 
 wrote:

> Hi,
>I am trying to create an rdd out of large matrix sc.parallelize
> suggest to use broadcast
> But when I do
>
> sc.broadcast(data)
> I get this error:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line
> 370, in broadcast
> pickled = pickleSer.dumps(value)
>   File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py",
> line 279, in dumps
> def dumps(self, obj): return cPickle.dumps(obj, 2)
> SystemError: error return without exception set
> Help?
>
>

>>>
>>
>


About SparkSQL 1.1.0 join between more than two table

2014-09-14 Thread boyingk...@163.com

Hi:
When I use spark SQL (1.0.1), I found it not support join between three 
tables,eg:
sql("SELECT * FROM youhao_data left join youhao_age on 
(youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on 
(youhao_age.rowkey=youhao_totalKiloMeter.rowkey)") 
I take the Exception:
Exception in thread "main" java.lang.RuntimeException: [1.90] failure: 
``UNION'' expected but `left' found

If the Spark SQL 1.1.0 has support join between three tables?




boyingk...@163.com

PathFilter for newAPIHadoopFile?

2014-09-14 Thread Eric Friedman
Hi,

I have a directory structure with parquet+avro data in it. There are a
couple of administrative files (.foo and/or _foo) that I need to ignore
when processing this data or Spark tries to read them as containing parquet
content, which they do not.

How can I set a PathFilter on the FileInputFormat used to construct an RDD?


Re: Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Zhanfeng Huo
Thank you very much. 

It is helpful for end users.



Zhanfeng Huo
 
From: Patrick Wendell
Date: 2014-09-15 10:19
To: Zhanfeng Huo
CC: user
Subject: Re: spark-1.1.0 with make-distribution.sh problem
Yeah that issue has been fixed by adding better docs, it just didn't make it in 
time for the release:

https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54


On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo  wrote:
resolved:

./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn 
-Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests

This code is a bit misleading




Zhanfeng Huo
 
From: Zhanfeng Huo
Date: 2014-09-12 14:13
To: user
Subject: spark-1.1.0 with make-distribution.sh problem
Hi,

I compile spark with cmd  "bash -x make-distribution.sh -Pyarn -Phive 
--skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 
-Phadoop.version=2.3.0", it errors.
   
How to use it correct?

   message:
+ set -o pipefail 
+ set -e 
+++ dirname make-distribution.sh 
++ cd . 
++ pwd 
+ FWDIR=/home/syn/spark/spark-1.1.0 
+ DISTDIR=/home/syn/spark/spark-1.1.0/dist 
+ SPARK_TACHYON=false 
+ MAKE_TGZ=false 
+ NAME=none 
+ (( 7 )) 
+ case $1 in 
+ break 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ which git 
++ git rev-parse --short HEAD 
+ GITREV=5f6f219 
+ '[' '!' -z 5f6f219 ']' 
+ GITREVSTRING=' (git revision 5f6f219)' 
+ unset GITREV 
+ which mvn 
++ mvn help:evaluate -Dexpression=project.version 
++ grep -v INFO 
++ tail -n 1 
+ VERSION=1.1.0 
++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test 
--with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 
++ grep -v INFO 
++ tail -n 1 
+ SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'

Best Regards


Zhanfeng Huo



Alternative to spark.executor.extraClassPath ?

2014-09-14 Thread innowireless TaeYun Kim
Hi,

 

On Spark Configuration document, spark.executor.extraClassPath is regarded
as a backwards-compatibility option. It also says that users typically
should not need to set this option.

 

Now, I must add a classpath to the executor environment (as well as to the
driver in the future, but for now I'm running YARN-client mode).

It's value is '/usr/lib/hbase/lib/*'. (I'm trying to use HBase classes.)

How can I add that to the executor environment without using
spark.executor.extraClassPath?

 

BTW, spark.executor.extraClassPath 'prepends' the classpath to the CLASSPATH
environment variable instead of appending it and seems to cause a few
problem to my application. (I've investigated launch_container.sh) Is there
a way to make it 'append' rather than 'prepend'?

 

I use Spark version 1.0.0.

 

Thanks.

 

 



Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make
it in time for the release:

https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54


On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo 
wrote:

> resolved:
>
> ./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon
> -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
>
> This code is a bit misleading
>
>
> --
> Zhanfeng Huo
>
>
> *From:* Zhanfeng Huo 
> *Date:* 2014-09-12 14:13
> *To:* user 
> *Subject:* spark-1.1.0 with make-distribution.sh problem
> Hi,
>
> I compile spark with cmd  "bash -x make-distribution.sh -Pyarn -Phive
> --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
> -Phadoop.version=2.3.0", it errors.
>
> How to use it correct?
>
>message:
> + set -o pipefail
> + set -e
> +++ dirname make-distribution.sh
> ++ cd .
> ++ pwd
> + FWDIR=/home/syn/spark/spark-1.1.0
> + DISTDIR=/home/syn/spark/spark-1.1.0/dist
> + SPARK_TACHYON=false
> + MAKE_TGZ=false
> + NAME=none
> + (( 7 ))
> + case $1 in
> + break
> + '[' -z /home/syn/usr/jdk1.7.0_55 ']'
> + '[' -z /home/syn/usr/jdk1.7.0_55 ']'
> + which git
> ++ git rev-parse --short HEAD
> + GITREV=5f6f219
> + '[' '!' -z 5f6f219 ']'
> + GITREVSTRING=' (git revision 5f6f219)'
> + unset GITREV
> + which mvn
> ++ mvn help:evaluate -Dexpression=project.version
> ++ grep -v INFO
> ++ tail -n 1
> + VERSION=1.1.0
> ++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive
> --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
> -Phadoop.version=2.3.0
> ++ grep -v INFO
> ++ tail -n 1
> + SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'
>
> Best Regards
> --
> Zhanfeng Huo
>
>


Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew,

I agree with Nicholas.  That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors.  I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented somewhere and it would be great to persist it
somewhere besides the mailing list.

best,
-Brad

On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas
 wrote:
> Andrew,
>
> This email was pretty helpful. I feel like this stuff should be summarized
> in the docs somewhere, or perhaps in a blog post.
>
> Do you know if it is?
>
> Nick
>
>
> On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash  wrote:
>>
>> The locality is how close the data is to the code that's processing it.
>> PROCESS_LOCAL means data is in the same JVM as the code that's running, so
>> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
>> same node, or in another executor on the same node, so is a little slower
>> because the data has to travel across an IPC connection.  RACK_LOCAL is even
>> slower -- data is on a different server so needs to be sent over the
>> network.
>>
>> Spark switches to lower locality levels when there's no unprocessed data
>> on a node that has idle CPUs.  In that situation you have two options: wait
>> until the busy CPUs free up so you can start another task that uses data on
>> that server, or start a new task on a farther away server that needs to
>> bring data from that remote place.  What Spark typically does is wait a bit
>> in the hopes that a busy CPU frees up.  Once that timeout expires, it starts
>> moving the data from far away to the free CPU.
>>
>> The main tunable option is how far long the scheduler waits before
>> starting to move data rather than code.  Those are the spark.locality.*
>> settings here: http://spark.apache.org/docs/latest/configuration.html
>>
>> If you want to prevent this from happening entirely, you can set the
>> values to ridiculously high numbers.  The documentation also mentions that
>> "0" has special meaning, so you can try that as well.
>>
>> Good luck!
>> Andrew
>>
>>
>> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung 
>> wrote:
>>>
>>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
>>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
>>>
>>> When these happen things get extremely slow.
>>>
>>> Does this mean that the executor got terminated and restarted?
>>>
>>> Is there a way to prevent this from happening (barring the machine
>>> actually going down, I'd rather stick with the same process)?
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast error

2014-09-14 Thread Chengi Liu
And when I use sparksubmit script, I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling
o26.trainKMeansModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


My spark submit code is

conf = SparkConf().set("spark.executor.memory",
"32G").set("spark.akka.frameSize", "1000")
sc = SparkContext(conf = conf)
rdd = sc.parallelize(matrix,5)

from pyspark.mllib.clustering import KMeans
from math import sqrt
clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
initializationMode="random")
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print "Within Set Sum of Squared Error = " + str(WSSSE)

Which is executed as following:
spark-submit --master $SPARKURL clustering_example.py  --executor-memory
32G  --driver-memory 60G

On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu 
wrote:

> How? Example please..
> Also, if I am running this in pyspark shell.. how do i configure
> spark.akka.frameSize ??
>
>
> On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das 
> wrote:
>
>> When the data size is huge, you better of use the torrentBroadcastFactory.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu 
>> wrote:
>>
>>> Specifically the error I see when I try to operate on rdd created by
>>> sc.parallelize method
>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
>>> (10485760 bytes). Consider using broadcast variables for large values.
>>>
>>> On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu 
>>> wrote:
>>>
 Hi,
I am trying to create an rdd out of large matrix sc.parallelize
 suggest to use broadcast
 But when I do

 sc.broadcast(data)
 I get this error:

 Traceback (most recent call last):
   File "", line 1, in 
   File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line
 370, in broadcast
 pickled = pickleSer.dumps(value)
   File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py",
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 SystemError: error return without exception set
 Help?


>>>
>>
>


Re: compiling spark source code

2014-09-14 Thread Matei Zaharia
I've seen the "file name too long" error when compiling on an encrypted Linux 
file system -- some of them have a limit on file name lengths. If you're on 
Linux, can you try compiling inside /tmp instead?

Matei

On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote:

Can you try "sbt/sbt clean" first?

On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu  wrote:
bq. [error] File name too long

It is not clear which file(s) loadfiles was loading.
Is the filename in earlier part of the output ?

Cheers

On Sat, Sep 13, 2014 at 10:58 AM, kkptninja  wrote:
Hi Ted,

Thanks for the prompt reply :)

please find details of the issue at this url  http://pastebin.com/Xt0hZ38q


Kind Regards




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Re: spark 1.1.0 unit tests fail

2014-09-14 Thread Koert Kuipers
ok sounds good. those were the only tests that failed by the way


On Sun, Sep 14, 2014 at 1:07 AM, Andrew Or  wrote:

> Hi Koert,
>
> Thanks for reporting this. These tests have been flaky even on the master
> branch for a long time. You can safely disregard these test failures, as
> the root cause is port collisions from the many SparkContexts we create
> over the course of the entire test. There is a patch that fixes this but
> not back ported into branch-1.1 yet. I will do that shortly.
>
> -Andrew
>
> 2014-09-13 17:27 GMT-07:00 Koert Kuipers :
>
> on ubuntu 12.04 with 2 cores and 8G of RAM i see errors when i run the
>> tests for spark 1.1.0. not sure how significant this is, since i used to
>> see errors for spark 1.0.0 too
>>
>> $ java -version
>> java version "1.6.0_43"
>> Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
>> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)
>>
>> $ mvn -version
>> Apache Maven 3.0.4
>> Maven home: /usr/share/maven
>> Java version: 1.6.0_43, vendor: Sun Microsystems Inc.
>> Java home: /usr/lib/jvm/jdk1.6.0_43/jre
>> Default locale: en_US, platform encoding: UTF-8
>> OS name: "linux", version: "3.5.0-54-generic", arch: "amd64", family:
>> "unix"
>>
>> $ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
>> -XX:ReservedCodeCacheSize=512m"
>> $ mvn clean package -DskipTests
>> $ mvn test
>>
>> it is still running, and is very slow (and curiously with very low cpu
>> usage, like 5%) but i already see the following errors:
>>
>> DriverSuite:
>> - driver should exit after finishing *** FAILED ***
>>   TestFailedDueToTimeoutException was thrown during property evaluation.
>> (DriverSuite.scala:40)
>> Message: The code passed to failAfter did not complete within 60
>> seconds.
>> Location: (DriverSuite.scala:41)
>> Occurred at table row 0 (zero based, not counting headings), which
>> had values (
>>   master = local
>> )
>>
>> SparkSubmitSuite:
>> - launch simple application with spark-submit *** FAILED ***
>>   org.apache.spark.SparkException: Process List(./bin/spark-submit,
>> --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp,
>> --master, local, file:/tmp/1410653580697-0/testJar-1410653580697.jar)
>> exited with code 1
>>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>>   at
>> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>>   at
>> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>   ...
>> - spark submit includes jars passed in through --jar *** FAILED ***
>>   org.apache.spark.SparkException: Process List(./bin/spark-submit,
>> --class, org.apache.spark.deploy.JarCreationTest, --name, testApp,
>> --master, local-cluster[2,1,512], --jars,
>> file:/tmp/1410653674739-0/testJar-1410653674790.jar,file:/tmp/1410653674791-0/testJar-1410653674833.jar,
>> file:/tmp/1410653674737-0/testJar-1410653674737.jar) exited with code 1
>>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>>   at
>> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>>   at
>> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>>   at
>> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>   ...
>>
>>
>


Re: Broadcast error

2014-09-14 Thread Chengi Liu
How? Example please..
Also, if I am running this in pyspark shell.. how do i configure
spark.akka.frameSize ??


On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das 
wrote:

> When the data size is huge, you better of use the torrentBroadcastFactory.
>
> Thanks
> Best Regards
>
> On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu 
> wrote:
>
>> Specifically the error I see when I try to operate on rdd created by
>> sc.parallelize method
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
>> (10485760 bytes). Consider using broadcast variables for large values.
>>
>> On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu 
>> wrote:
>>
>>> Hi,
>>>I am trying to create an rdd out of large matrix sc.parallelize
>>> suggest to use broadcast
>>> But when I do
>>>
>>> sc.broadcast(data)
>>> I get this error:
>>>
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line
>>> 370, in broadcast
>>> pickled = pickleSer.dumps(value)
>>>   File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py", line
>>> 279, in dumps
>>> def dumps(self, obj): return cPickle.dumps(obj, 2)
>>> SystemError: error return without exception set
>>> Help?
>>>
>>>
>>
>


Re: HBase 0.96+ with Spark 1.0+

2014-09-14 Thread Reinis Vicups
I did actually try Seans suggestion just before I posted for the first 
time in this thread. I got an error when doing this and thought that I 
am not understanding what Sean was suggesting.


Now I re-attempted your suggestions with spark 1.0.0-cdh5.1.0, hbase 
0.98.1-cdh5.1.0 and hadoop 2.3.0-cdh5.1.0 I am currently using.


I used following:

  val mortbayEnforce = "org.mortbay.jetty" % "servlet-api" % "3.0.20100224"
  val mortbayExclusion = ExclusionRule(organization = 
"org.mortbay.jetty", name = "servlet-api-2.5")


and applied this to hadoop and hbase dependencies e.g. like this:

val hbase = Seq(HBase.server, HBase.common, HBase.compat, HBase.compat2, 
HBase.protocol, mortbayEnforce).map(_.excludeAll(HBase.exclusions: _*))


private object HBase {
val server = "org.apache.hbase"  % "hbase-server" % Version.HBase
...
val exclusions = Seq(ExclusionRule("org.apache.ant"), mortbayExclusion)
}

I still get the error I got the last time I tried this experiment:

14/09/14 18:28:09 ERROR metrics.MetricsSystem: Sink class 
org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
at 
org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84)
at 
org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
at org.apache.spark.SparkContext.(SparkContext.scala:202)
at 
d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply$mcV$sp(SimpleTicketTextSimilaritySparkJobSpec.scala:29)
at 
d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21)
at 
d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)

at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)

at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)

at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)

at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1714)
at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1683)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$super$run(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760)
at 
org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760)

at org.scalatest.SuperEngine.runImpl(Engine.scala

failed to run SimpleApp locally on macbook

2014-09-14 Thread Gary Zhao
Hello

I'm new to Spark and I couldn't make the SimpleApp run on my macbook. I
feel it's related to network configuration. Could anyone take a look?
Thanks.

14/09/14 10:10:36 INFO Utils: Fetching
http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to
/var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4ftc9n6/T/fetchFileTemp1014702023347580837.tmp
14/09/14 10:11:36 INFO Executor: Fetching
http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar with timestamp
1410714636103
14/09/14 10:11:36 INFO Utils: Fetching
http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to
/var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4ftc9n6/T/fetchFileTemp4432132500879081005.tmp
14/09/14 10:11:36 ERROR Executor: Exception in task ID 1
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:348)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org
$apache$spark$executor$Executor$$updateDependencies(Executor.scala:328)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/09/14 10:11:36 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/09/14 10:11:36 WARN TaskSetManager: Loss was due to
java.net.SocketTimeoutException
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:348)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at sca

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Sorry, I meant any *other* SBT files.

However, what happens if you remove the line:

exclude("org.eclipse.jetty.orbit", "javax.servlet")


dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Sun, Sep 14, 2014 at 11:53 AM, Dean Wampler 
wrote:

> Can you post your whole SBT build file(s)?
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler 
> wrote:
>
>> Hi,
>>
>> I just called:
>>
>> > test
>>
>> or
>>
>> > run
>>
>> Thorsten
>>
>>
>> Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:
>>
>>  Hi,
>>>
>>> What is your SBT command and the parameters?
>>>
>>> Arthur
>>>
>>>
>>> On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler  wrote:
>>>
>>>  Hello,

 I am writing a Spark App which is already working so far.
 Now I started to build also some UnitTests, but I am running into some
 dependecy problems and I cannot find a solution right now. Perhaps someone
 could help me.

 I build my Spark Project with SBT and it seems to be configured well,
 because compiling, assembling and running the built jar with spark-submit
 are working well.

 Now I started with the UnitTests, which I located under /src/test/scala.

 When I call "test" in sbt, I get the following:

 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered
 BlockManager
 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
 [trace] Stack trace suppressed: run last test:test for the full output.
 [error] Could not run test test.scala.SetSuite: 
 java.lang.NoClassDefFoundError:
 javax/servlet/http/HttpServletResponse
 [info] Run completed in 626 milliseconds.
 [info] Total number of tests run: 0
 [info] Suites: completed 0, aborted 0
 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
 [info] All tests passed.
 [error] Error during tests:
 [error] test.scala.SetSuite
 [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
 [error] Total time: 3 s, completed 10.09.2014 12:22:06

 last test:test gives me the following:

  last test:test
>
 [debug] Running TaskDef(test.scala.SetSuite,
 org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 at org.apache.spark.broadcast.HttpBroadcast$.createServer(
 HttpBroadcast.scala:156)
 at org.apache.spark.broadcast.HttpBroadcast$.initialize(
 HttpBroadcast.scala:127)
 at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
 HttpBroadcastFactory.scala:31)
 at org.apache.spark.broadcast.BroadcastManager.initialize(
 BroadcastManager.scala:48)
 at org.apache.spark.broadcast.BroadcastManager.(
 BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.SparkContext.(SparkContext.scala:202)
 at test.scala.SetSuite.(SparkTest.scala:16)

 I also noticed right now, that sbt run is also not working:

 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
 [error] (run-main-2) java.lang.NoClassDefFoundError:
 javax/servlet/http/HttpServletResponse
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 at org.apache.spark.broadcast.HttpBroadcast$.createServer(
 HttpBroadcast.scala:156)
 at org.apache.spark.broadcast.HttpBroadcast$.initialize(
 HttpBroadcast.scala:127)
 at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
 HttpBroadcastFactory.scala:31)
 at org.apache.spark.broadcast.BroadcastManager.initialize(
 BroadcastManager.scala:48)
 at org.apache.spark.broadcast.BroadcastManager.(
 BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.SparkContext.(SparkContext.scala:202)
 at main.scala.PartialDuplicateScanner$.main(
 PartialDuplicateScanner.scala:29)
 at main.scala.PartialDuplicateScanner.main(
 PartialDuplicateScanner.scala)

 Here is my Testprojekt.sbt file:

 name := "Testprojekt"

 version := "1.0"

 scalaVersion := "2.10.4"

 libraryDependencies ++= {
   Seq(
 "org.apache.lucene" % "lucene-core" % "4.9.0",
 "org.apache.lucene" % "lucene-analyzers-common" % "4.9.

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Can you post your whole SBT build file(s)?

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler  wrote:

> Hi,
>
> I just called:
>
> > test
>
> or
>
> > run
>
> Thorsten
>
>
> Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:
>
>  Hi,
>>
>> What is your SBT command and the parameters?
>>
>> Arthur
>>
>>
>> On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler  wrote:
>>
>>  Hello,
>>>
>>> I am writing a Spark App which is already working so far.
>>> Now I started to build also some UnitTests, but I am running into some
>>> dependecy problems and I cannot find a solution right now. Perhaps someone
>>> could help me.
>>>
>>> I build my Spark Project with SBT and it seems to be configured well,
>>> because compiling, assembling and running the built jar with spark-submit
>>> are working well.
>>>
>>> Now I started with the UnitTests, which I located under /src/test/scala.
>>>
>>> When I call "test" in sbt, I get the following:
>>>
>>> 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered
>>> BlockManager
>>> 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
>>> [trace] Stack trace suppressed: run last test:test for the full output.
>>> [error] Could not run test test.scala.SetSuite: 
>>> java.lang.NoClassDefFoundError:
>>> javax/servlet/http/HttpServletResponse
>>> [info] Run completed in 626 milliseconds.
>>> [info] Total number of tests run: 0
>>> [info] Suites: completed 0, aborted 0
>>> [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
>>> [info] All tests passed.
>>> [error] Error during tests:
>>> [error] test.scala.SetSuite
>>> [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
>>> [error] Total time: 3 s, completed 10.09.2014 12:22:06
>>>
>>> last test:test gives me the following:
>>>
>>>  last test:test

>>> [debug] Running TaskDef(test.scala.SetSuite,
>>> org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer(
>>> HttpBroadcast.scala:156)
>>> at org.apache.spark.broadcast.HttpBroadcast$.initialize(
>>> HttpBroadcast.scala:127)
>>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
>>> HttpBroadcastFactory.scala:31)
>>> at org.apache.spark.broadcast.BroadcastManager.initialize(
>>> BroadcastManager.scala:48)
>>> at org.apache.spark.broadcast.BroadcastManager.(
>>> BroadcastManager.scala:35)
>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>> at test.scala.SetSuite.(SparkTest.scala:16)
>>>
>>> I also noticed right now, that sbt run is also not working:
>>>
>>> 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
>>> [error] (run-main-2) java.lang.NoClassDefFoundError: javax/servlet/http/
>>> HttpServletResponse
>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer(
>>> HttpBroadcast.scala:156)
>>> at org.apache.spark.broadcast.HttpBroadcast$.initialize(
>>> HttpBroadcast.scala:127)
>>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
>>> HttpBroadcastFactory.scala:31)
>>> at org.apache.spark.broadcast.BroadcastManager.initialize(
>>> BroadcastManager.scala:48)
>>> at org.apache.spark.broadcast.BroadcastManager.(
>>> BroadcastManager.scala:35)
>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>> at main.scala.PartialDuplicateScanner$.main(
>>> PartialDuplicateScanner.scala:29)
>>> at main.scala.PartialDuplicateScanner.main(
>>> PartialDuplicateScanner.scala)
>>>
>>> Here is my Testprojekt.sbt file:
>>>
>>> name := "Testprojekt"
>>>
>>> version := "1.0"
>>>
>>> scalaVersion := "2.10.4"
>>>
>>> libraryDependencies ++= {
>>>   Seq(
>>> "org.apache.lucene" % "lucene-core" % "4.9.0",
>>> "org.apache.lucene" % "lucene-analyzers-common" % "4.9.0",
>>> "org.apache.lucene" % "lucene-queryparser" % "4.9.0",
>>> ("org.apache.spark" %% "spark-core" % "1.0.2").
>>> exclude("org.mortbay.jetty", "servlet-api").
>>> exclude("commons-beanutils", "commons-beanutils-core").
>>> exclude("commons-collections", "commons-collections").
>>> exclude("commons-collections", "commons-collections").
>>> exclude("com.esotericsoftware.minlog", "minlog").
>>> exclude("org.eclipse.jetty.orbit", "javax.mail.glassfish").
>>> exclude("org.eclipse.jetty.orbit", "javax.transa

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
Take a look at bin/run-example

Cheers

On Sun, Sep 14, 2014 at 9:15 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> I applied the patch.
>
> 1) patched
>
> $ patch -p0 -i spark-1297-v5.txt
> patching file docs/building-with-maven.md
> patching file examples/pom.xml
>
>
> 2) Compilation result
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. SUCCESS [1.550s]
> [INFO] Spark Project Core  SUCCESS
> [1:32.175s]
> [INFO] Spark Project Bagel ... SUCCESS
> [10.809s]
> [INFO] Spark Project GraphX .. SUCCESS
> [31.435s]
> [INFO] Spark Project Streaming ... SUCCESS
> [44.518s]
> [INFO] Spark Project ML Library .. SUCCESS
> [48.992s]
> [INFO] Spark Project Tools ... SUCCESS [7.028s]
> [INFO] Spark Project Catalyst  SUCCESS
> [40.365s]
> [INFO] Spark Project SQL . SUCCESS
> [43.305s]
> [INFO] Spark Project Hive  SUCCESS
> [36.464s]
> [INFO] Spark Project REPL  SUCCESS
> [20.319s]
> [INFO] Spark Project YARN Parent POM . SUCCESS [1.032s]
> [INFO] Spark Project YARN Stable API . SUCCESS
> [19.379s]
> [INFO] Spark Project Hive Thrift Server .. SUCCESS
> [12.470s]
> [INFO] Spark Project Assembly  SUCCESS
> [13.822s]
> [INFO] Spark Project External Twitter  SUCCESS [9.566s]
> [INFO] Spark Project External Kafka .. SUCCESS
> [12.848s]
> [INFO] Spark Project External Flume Sink . SUCCESS
> [10.437s]
> [INFO] Spark Project External Flume .. SUCCESS
> [14.554s]
> [INFO] Spark Project External ZeroMQ . SUCCESS [9.994s]
> [INFO] Spark Project External MQTT ... SUCCESS [8.684s]
> [INFO] Spark Project Examples  SUCCESS
> [1:31.610s]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 9:41.700s
> [INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014
> [INFO] Final Memory: 83M/1071M
> [INFO]
> 
>
>
>
> 3) testing:
> scala> package org.apache.spark.examples
> :1: error: illegal start of definition
>package org.apache.spark.examples
>^
>
>
> scala> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.client.HBaseAdmin
>
> scala> import org.apache.hadoop.hbase.{HBaseConfiguration,
> HTableDescriptor}
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>
> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>
> scala> import org.apache.spark._
> import org.apache.spark._
>
> scala> object HBaseTest {
>  | def main(args: Array[String]) {
>  | val sparkConf = new SparkConf().setAppName("HBaseTest")
>  | val sc = new SparkContext(sparkConf)
>  | val conf = HBaseConfiguration.create()
>  | // Other options for configuring scan behavior are available. More
> information available at
>  | //
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
>  | conf.set(TableInputFormat.INPUT_TABLE, args(0))
>  | // Initialize hBase table if necessary
>  | val admin = new HBaseAdmin(conf)
>  | if (!admin.isTableAvailable(args(0))) {
>  | val tableDesc = new HTableDescriptor(args(0))
>  | admin.createTable(tableDesc)
>  | }
>  | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
>  | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
>  | classOf[org.apache.hadoop.hbase.client.Result])
>  | hBaseRDD.count()
>  | sc.stop()
>  | }
>  | }
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
> defined module HBaseTest
>
>
>
> Now only got error when trying to run "package org.apache.spark.examples”
>
> Please advise.
> Regards
> Arthur
>
>
>
> On 14 Sep, 2014, at 11:41 pm, Ted Yu  wrote:
>
> I applied the patch on master branch without rejects.
>
> If you use spark 1.0.2, use pom.xml attached to the JIRA.
>
> On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com <
> arthur.hk.c...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks!
>>
>> patch -p0 -i spark-1297-v5.txt
>> patching file docs/building-with-maven.md
>> patching file examples/pom.xml
>> Hunk #1 FAILED at 45.
>> Hunk #2 FAILED at 110.
>> 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
>>
>> 

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

I applied the patch.

1) patched

$ patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


2) Compilation result
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.550s]
[INFO] Spark Project Core  SUCCESS [1:32.175s]
[INFO] Spark Project Bagel ... SUCCESS [10.809s]
[INFO] Spark Project GraphX .. SUCCESS [31.435s]
[INFO] Spark Project Streaming ... SUCCESS [44.518s]
[INFO] Spark Project ML Library .. SUCCESS [48.992s]
[INFO] Spark Project Tools ... SUCCESS [7.028s]
[INFO] Spark Project Catalyst  SUCCESS [40.365s]
[INFO] Spark Project SQL . SUCCESS [43.305s]
[INFO] Spark Project Hive  SUCCESS [36.464s]
[INFO] Spark Project REPL  SUCCESS [20.319s]
[INFO] Spark Project YARN Parent POM . SUCCESS [1.032s]
[INFO] Spark Project YARN Stable API . SUCCESS [19.379s]
[INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s]
[INFO] Spark Project Assembly  SUCCESS [13.822s]
[INFO] Spark Project External Twitter  SUCCESS [9.566s]
[INFO] Spark Project External Kafka .. SUCCESS [12.848s]
[INFO] Spark Project External Flume Sink . SUCCESS [10.437s]
[INFO] Spark Project External Flume .. SUCCESS [14.554s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.994s]
[INFO] Spark Project External MQTT ... SUCCESS [8.684s]
[INFO] Spark Project Examples  SUCCESS [1:31.610s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 9:41.700s
[INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014
[INFO] Final Memory: 83M/1071M
[INFO] 



3) testing:  
scala> package org.apache.spark.examples
:1: error: illegal start of definition
   package org.apache.spark.examples
   ^


scala> import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HBaseAdmin

scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

scala> import org.apache.spark._
import org.apache.spark._

scala> object HBaseTest {
 | def main(args: Array[String]) {
 | val sparkConf = new SparkConf().setAppName("HBaseTest")
 | val sc = new SparkContext(sparkConf)
 | val conf = HBaseConfiguration.create()
 | // Other options for configuring scan behavior are available. More 
information available at
 | // 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
 | conf.set(TableInputFormat.INPUT_TABLE, args(0))
 | // Initialize hBase table if necessary
 | val admin = new HBaseAdmin(conf)
 | if (!admin.isTableAvailable(args(0))) {
 | val tableDesc = new HTableDescriptor(args(0))
 | admin.createTable(tableDesc)
 | }
 | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
 | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
 | classOf[org.apache.hadoop.hbase.client.Result])
 | hBaseRDD.count()
 | sc.stop()
 | }
 | }
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
defined module HBaseTest



Now only got error when trying to run "package org.apache.spark.examples”

Please advise.
Regards
Arthur



On 14 Sep, 2014, at 11:41 pm, Ted Yu  wrote:

> I applied the patch on master branch without rejects.
> 
> If you use spark 1.0.2, use pom.xml attached to the JIRA.
> 
> On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> Thanks!
> 
> patch -p0 -i spark-1297-v5.txt
> patching file docs/building-with-maven.md
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 FAILED at 110.
> 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
> 
> Still got errors.
> 
> Regards
> Arthur
> 
> On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:
> 
>> spark-1297-v5.txt is level 0 patch
>> 
>> Please use spark-1297-v5.txt
>> 
>> Cheers
>> 
>> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi,
>> 
>> Thanks!!
>> 
>> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
>> are good here,  but not sp

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
I applied the patch on master branch without rejects.

If you use spark 1.0.2, use pom.xml attached to the JIRA.

On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> Thanks!
>
> patch -p0 -i spark-1297-v5.txt
> patching file docs/building-with-maven.md
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 FAILED at 110.
> 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
>
> Still got errors.
>
> Regards
> Arthur
>
> On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:
>
> spark-1297-v5.txt is level 0 patch
>
> Please use spark-1297-v5.txt
>
> Cheers
>
> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com <
> arthur.hk.c...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks!!
>>
>> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt
>> are good here,  but not spark-1297-v5.txt:
>>
>>
>> $ patch -p1 -i spark-1297-v4.txt
>> patching file examples/pom.xml
>>
>> $ patch -p1 -i spark-1297-v5.txt
>> can't find file to patch at input line 5
>> Perhaps you used the wrong -p or --strip option?
>> The text leading up to this was:
>> --
>> |diff --git docs/building-with-maven.md docs/building-with-maven.md
>> |index 672d0ef..f8bcd2b 100644
>> |--- docs/building-with-maven.md
>> |+++ docs/building-with-maven.md
>> --
>> File to patch:
>>
>>
>>
>>
>>
>>
>> Please advise.
>> Regards
>> Arthur
>>
>>
>>
>> On 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:
>>
>> Spark examples builds against hbase 0.94 by default.
>>
>> If you want to run against 0.98, see:
>> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
>>
>> Cheers
>>
>> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com <
>> arthur.hk.c...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have tried to to run *HBaseTest.scala, *but I  got following errors,
>>> any ideas to how to fix them?
>>>
>>> Q1)
>>> scala> package org.apache.spark.examples
>>> :1: error: illegal start of definition
>>>package org.apache.spark.examples
>>>
>>>
>>> Q2)
>>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> :31: error: object hbase is not a member of package
>>> org.apache.hadoop
>>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>
>>>
>>>
>>> Regards
>>> Arthur
>>>
>>
>>
>>
>>
>
>


Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

My bad.  Tried again, worked.


patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


Thanks!
Arthur

On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> Thanks!
> 
> patch -p0 -i spark-1297-v5.txt
> patching file docs/building-with-maven.md
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 FAILED at 110.
> 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
> 
> Still got errors.
> 
> Regards
> Arthur
> 
> On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:
> 
>> spark-1297-v5.txt is level 0 patch
>> 
>> Please use spark-1297-v5.txt
>> 
>> Cheers
>> 
>> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi,
>> 
>> Thanks!!
>> 
>> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
>> are good here,  but not spark-1297-v5.txt:
>> 
>> 
>> $ patch -p1 -i spark-1297-v4.txt
>> patching file examples/pom.xml
>> 
>> $ patch -p1 -i spark-1297-v5.txt
>> can't find file to patch at input line 5
>> Perhaps you used the wrong -p or --strip option?
>> The text leading up to this was:
>> --
>> |diff --git docs/building-with-maven.md docs/building-with-maven.md
>> |index 672d0ef..f8bcd2b 100644
>> |--- docs/building-with-maven.md
>> |+++ docs/building-with-maven.md
>> --
>> File to patch: 
>> 
>> 
>> 
>> 
>> 
>> 
>> Please advise.
>> Regards
>> Arthur
>> 
>> 
>> 
>> On 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:
>> 
>>> Spark examples builds against hbase 0.94 by default.
>>> 
>>> If you want to run against 0.98, see:
>>> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
>>> 
>>> Cheers
>>> 
>>> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
>>>  wrote:
>>> Hi, 
>>> 
>>> I have tried to to run HBaseTest.scala, but I  got following errors, any 
>>> ideas to how to fix them?
>>> 
>>> Q1) 
>>> scala> package org.apache.spark.examples
>>> :1: error: illegal start of definition
>>>package org.apache.spark.examples
>>> 
>>> 
>>> Q2) 
>>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> :31: error: object hbase is not a member of package 
>>> org.apache.hadoop
>>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> 
>>> 
>>> 
>>> Regards
>>> Arthur
>>> 
>> 
>> 
>> 
> 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

Thanks!

patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

Still got errors.

Regards
Arthur

On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:

> spark-1297-v5.txt is level 0 patch
> 
> Please use spark-1297-v5.txt
> 
> Cheers
> 
> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> Thanks!!
> 
> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
> are good here,  but not spark-1297-v5.txt:
> 
> 
> $ patch -p1 -i spark-1297-v4.txt
> patching file examples/pom.xml
> 
> $ patch -p1 -i spark-1297-v5.txt
> can't find file to patch at input line 5
> Perhaps you used the wrong -p or --strip option?
> The text leading up to this was:
> --
> |diff --git docs/building-with-maven.md docs/building-with-maven.md
> |index 672d0ef..f8bcd2b 100644
> |--- docs/building-with-maven.md
> |+++ docs/building-with-maven.md
> --
> File to patch: 
> 
> 
> 
> 
> 
> 
> Please advise.
> Regards
> Arthur
> 
> 
> 
> On 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:
> 
>> Spark examples builds against hbase 0.94 by default.
>> 
>> If you want to run against 0.98, see:
>> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
>> 
>> Cheers
>> 
>> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi, 
>> 
>> I have tried to to run HBaseTest.scala, but I  got following errors, any 
>> ideas to how to fix them?
>> 
>> Q1) 
>> scala> package org.apache.spark.examples
>> :1: error: illegal start of definition
>>package org.apache.spark.examples
>> 
>> 
>> Q2) 
>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> :31: error: object hbase is not a member of package 
>> org.apache.hadoop
>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> 
>> 
>> 
>> Regards
>> Arthur
>> 
> 
> 
> 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
spark-1297-v5.txt is level 0 patch

Please use spark-1297-v5.txt

Cheers

On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> Thanks!!
>
> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt
> are good here,  but not spark-1297-v5.txt:
>
>
> $ patch -p1 -i spark-1297-v4.txt
> patching file examples/pom.xml
>
> $ patch -p1 -i spark-1297-v5.txt
> can't find file to patch at input line 5
> Perhaps you used the wrong -p or --strip option?
> The text leading up to this was:
> --
> |diff --git docs/building-with-maven.md docs/building-with-maven.md
> |index 672d0ef..f8bcd2b 100644
> |--- docs/building-with-maven.md
> |+++ docs/building-with-maven.md
> --
> File to patch:
>
>
>
>
>
>
> Please advise.
> Regards
> Arthur
>
>
>
> On 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:
>
> Spark examples builds against hbase 0.94 by default.
>
> If you want to run against 0.98, see:
> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
>
> Cheers
>
> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com <
> arthur.hk.c...@gmail.com> wrote:
>
>> Hi,
>>
>> I have tried to to run *HBaseTest.scala, *but I  got following errors,
>> any ideas to how to fix them?
>>
>> Q1)
>> scala> package org.apache.spark.examples
>> :1: error: illegal start of definition
>>package org.apache.spark.examples
>>
>>
>> Q2)
>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> :31: error: object hbase is not a member of package
>> org.apache.hadoop
>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>
>>
>>
>> Regards
>> Arthur
>>
>
>
>
>


Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,Thanks!!I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt are good here,  but not spark-1297-v5.txt:$ patch -p1 -i spark-1297-v4.txtpatching file examples/pom.xml$ patch -p1 -i spark-1297-v5.txtcan't find file to patch at input line 5Perhaps you used the wrong -p or --strip option?The text leading up to this was:--|diff --git docs/building-with-maven.md docs/building-with-maven.md|index 672d0ef..f8bcd2b 100644|--- docs/building-with-maven.md|+++ docs/building-with-maven.md--File to patch: {\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural

\f0\fs22 \cf0 \CocoaLigature0 diff --git docs/building-with-maven.md 
docs/building-with-maven.md\
index 672d0ef..f8bcd2b 100644\
--- docs/building-with-maven.md\
+++ docs/building-with-maven.md\
@@ -71,6 +71,9 @@ For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop 
versions with YARN\
   \
 \
 \
+To build against HBase 0.98.x releases, "hbase-hadoop1" is the default 
profile. This means hbase-0.98.x-hadoop1 would be used.\
+When building against hadoop-2, "hbase-hadoop2" profile should be specified.\
+\
 Examples:\
 \
 \{% highlight bash %\}\
diff --git examples/pom.xml examples/pom.xml\
index 8c4c128..9ae50cd 100644\
--- examples/pom.xml\
+++ examples/pom.xml\
@@ -45,6 +45,30 @@\
 \
   \
 \
+\
+  hbase-hadoop2\
+  \
+\
+  hbase.profile\
+  hadoop2\
+\
+  \
+  \
+0.98.4-hadoop2\
+  \
+\
+\
+  hbase-hadoop1\
+  \
+\
+  !hbase.profile\
+\
+  \
+  \
+0.98.4-hadoop1\
+  \
+\
+\
   \
   \
   \
@@ -110,36 +134,121 @@\
   $\{project.version\}\
 \
 \
-  org.apache.hbase\
-  hbase\
-  $\{hbase.version\}\
-  \
-\
-  asm\
-  asm\
-\
-\
-  org.jboss.netty\
-  netty\
-\
-\
-  io.netty\
-  netty\
-\
-\
-  commons-logging\
-  commons-logging\
-\
-\
-  org.jruby\
-  jruby-complete\
-\
-  \
-\
-\
   org.eclipse.jetty\
   jetty-server\
 \
+  \
+org.apache.hbase\
+hbase-testing-util\
+$\{hbase.version\}\
+\
+  \
+org.jruby\
+jruby-complete\
+  \
+\
+  \
+  \
+org.apache.hbase\
+hbase-protocol\
+$\{hbase.version\}\
+  \
+  \
+org.apache.hbase\
+hbase-common\
+$\{hbase.version\}\
+  \
+  \
+org.apache.hbase\
+hbase-client\
+$\{hbase.version\}\
+\
+ \
+  io.netty\
+  netty\
+ \
+   \
+  \
+  \
+org.apache.hbase\
+hbase-server\
+$\{hbase.version\}\
+\
+  \
+org.apache.hadoop\
+hadoop-core\
+  \
+  \
+org.apache.hadoop\
+hadoop-client\
+  \
+  \
+org.apache.hadoop\
+hadoop-mapreduce-client-jobclient\
+  \
+  \
+org.apache.hadoop\
+hadoop-mapreduce-client-core\
+  \
+  \
+org.apache.hadoop\
+hadoop-auth\
+  \
+  \
+org.apache.hadoop\
+hadoop-annotations\
+  \
+  \
+org.apache.hadoop\
+hadoop-hdfs\
+  \
+  \
+org.apache.hbase\
+hbase-hadoop1-compat\
+  \
+  \
+org.apache.commons\
+commons-math\
+  \
+  \
+com.sun.jersey\
+jersey-core\
+  \
+  \
+org.slf4j\
+slf4j-api\
+  \
+  \
+com.sun.jersey\
+jersey-server\
+  \
+  \
+com.sun.jersey\
+jersey-core\
+  \
+  \
+com.sun.jersey\
+jersey-json\
+  \
+  \
+\
+commons-io\
+commons-io\
+  \
+\
+  \
+  \
+org.apache.hbase\
+hbase-hadoop-compat\
+$\{hbase.version\}\
+  \
+  \
+org.apache.hbase\
+hbase-hadoop-compat\
+$\{hbase.version\}\
+test-jar\
+test\
+  \
 \
   com.twitter\
   algebird-core_$\{scala.binary.version\}\
}Please advise.RegardsArthurOn 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:Spark examples builds against hbase 0.94 by default.If you want to run against 0.98, see:SPA

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
Spark examples builds against hbase 0.94 by default.

If you want to run against 0.98, see:
SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297

Cheers

On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> I have tried to to run *HBaseTest.scala, *but I  got following errors,
> any ideas to how to fix them?
>
> Q1)
> scala> package org.apache.spark.examples
> :1: error: illegal start of definition
>package org.apache.spark.examples
>
>
> Q2)
> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> :31: error: object hbase is not a member of package
> org.apache.hadoop
>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>
>
>
> Regards
> Arthur
>


Re: Broadcast error

2014-09-14 Thread Akhil Das
When the data size is huge, you better of use the torrentBroadcastFactory.

Thanks
Best Regards

On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu  wrote:

> Specifically the error I see when I try to operate on rdd created by
> sc.parallelize method
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
> (10485760 bytes). Consider using broadcast variables for large values.
>
> On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu 
> wrote:
>
>> Hi,
>>I am trying to create an rdd out of large matrix sc.parallelize
>> suggest to use broadcast
>> But when I do
>>
>> sc.broadcast(data)
>> I get this error:
>>
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line 370,
>> in broadcast
>> pickled = pickleSer.dumps(value)
>>   File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py", line
>> 279, in dumps
>> def dumps(self, obj): return cPickle.dumps(obj, 2)
>> SystemError: error return without exception set
>> Help?
>>
>>
>


object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, 

I have tried to to run HBaseTest.scala, but I  got following errors, any ideas 
to how to fix them?

Q1) 
scala> package org.apache.spark.examples
:1: error: illegal start of definition
   package org.apache.spark.examples


Q2) 
scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
:31: error: object hbase is not a member of package org.apache.hadoop
   import org.apache.hadoop.hbase.mapreduce.TableInputFormat



Regards
Arthur

Re: Driver fail with out of memory exception

2014-09-14 Thread Akhil Das
Try increasing the number of partitions while doing a reduceByKey()


Thanks
Best Regards

On Sun, Sep 14, 2014 at 5:11 PM, richiesgr  wrote:

> Hi
>
> I've written a job (I think not very complicated only 1 reduceByKey) the
> driver JVM always hang with OOM killing the worker of course. How can I
> know
> what is running on the driver and what is running on the worker how to
> debug
> the memory problem.
> I've already used --driver-memory 4g params to give more memory ut nothing
> help it always fail
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Driver fail with out of memory exception

2014-09-14 Thread richiesgr
Hi

I've written a job (I think not very complicated only 1 reduceByKey) the
driver JVM always hang with OOM killing the worker of course. How can I know
what is running on the driver and what is running on the worker how to debug
the memory problem.
I've already used --driver-memory 4g params to give more memory ut nothing
help it always fail

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



File operations on spark

2014-09-14 Thread rapelly kartheek
Hi

I am trying to perform read/write file operations in spark by creating
Writable object.
But, I am not able to write to a file. The concerned data is not rdd.

Can someone please tell me how to perform read/write file operations on
non-rdd data in spark.

Regards
karthik


Re: Broadcast error

2014-09-14 Thread Chengi Liu
Specifically the error I see when I try to operate on rdd created by
sc.parallelize method
: org.apache.spark.SparkException: Job aborted due to stage failure:
Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.

On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu  wrote:

> Hi,
>I am trying to create an rdd out of large matrix sc.parallelize
> suggest to use broadcast
> But when I do
>
> sc.broadcast(data)
> I get this error:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line 370,
> in broadcast
> pickled = pickleSer.dumps(value)
>   File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py", line
> 279, in dumps
> def dumps(self, obj): return cPickle.dumps(obj, 2)
> SystemError: error return without exception set
> Help?
>
>


Broadcast error

2014-09-14 Thread Chengi Liu
Hi,
   I am trying to create an rdd out of large matrix sc.parallelize
suggest to use broadcast
But when I do

sc.broadcast(data)
I get this error:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line 370,
in broadcast
pickled = pickleSer.dumps(value)
  File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py", line
279, in dumps
def dumps(self, obj): return cPickle.dumps(obj, 2)
SystemError: error return without exception set
Help?


Re: Spark SQL

2014-09-14 Thread Burak Yavuz
Hi,

I'm not a master on SparkSQL, but from what I understand, the problem ıs that 
you're trying to access an RDD
inside an RDD here: val xyz = file.map(line => *** 
extractCurRate(sqlContext.sql("select rate ... *** and 
here:  xyz = file.map(line => *** extractCurRate(sqlContext.sql("select rate 
... ***.
RDDs can't be serialized inside other RDD tasks, therefore you're receiving the 
NullPointerException.

More specifically, you are trying to generate a SchemaRDD inside an RDD, which 
you can't do.

If file isn't huge, you can call .collect() to transform the RDD to an array 
and then use .map() on the Array.

If the file is huge, then you may do number 3 first, join the two RDDs using 
'txCurCode' as a key, and then do filtering
operations, etc...

Best,
Burak

- Original Message -
From: "rkishore999" 
To: u...@spark.incubator.apache.org
Sent: Saturday, September 13, 2014 10:29:26 PM
Subject: Spark SQL

val file =
sc.textFile("hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt")

1. val xyz = file.map(line => extractCurRate(sqlContext.sql("select rate
from CurrencyCodeRates where txCurCode = '" + line.substring(202,205) + "'
and fxCurCode = '" + fxCurCodesMap(line.substring(77,82)) + "' and
effectiveDate >= '" + line.substring(221,229) + "' order by effectiveDate
desc"))

2. val xyz = file.map(line => sqlContext.sql("select rate, txCurCode,
fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and
fxCurCode = 'CSD' and effectiveDate >= '20140901' order by effectiveDate
desc"))

3. val xyz = sqlContext.sql("select rate, txCurCode, fxCurCode,
effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode =
'CSD' and effectiveDate >= '20140901' order by effectiveDate desc")

xyz.saveAsTextFile("/user/output")

In statements 1 and 2 I'm getting nullpointer expecption. But statement 3 is
good. I'm guessing spark context and sql context are not going together
well.

Any suggestions regarding how I can achieve this?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp14183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org