date:20141021

Re: org/I0Itec/zkclient/serialize/ZkSerializer ClassNotFound

2014-10-21 Thread Akhil Das

You can add this jar
http://central.maven.org/maven2/com/101tec/zkclient/0.3/zkclient-0.3.jar
in the classpath to get ride of this. If you are hitting further exceptions
like classNotFound for metrics* etc, then make sure you have all these jars
in the classpath:

SPARK_CLASSPATH=SPARK_CLASSPATH:/root/akhld/spark/lib/
*spark-streaming-kafka_2.10-1.1.0.jar*:/root/akhld/spark/lib/
*kafka_2.10-0.8.0.jar*:/root/akhld/spark/lib/*zkclient-0.3.jar*
:/root/akhld/spark/lib/*metrics-core-2.2.0.jar*

Thanks
Best Regards

On Tue, Oct 21, 2014 at 10:45 AM, skane sk...@websense.com wrote:

 I'm having the same problem with Spark 1.0.0. I got the 
 JavaKafkaWordCount.java example working on my workstation running Spark
 locally after doing a build, but when I tried to get the example running on
 YARN, I got the same error. I used the uber jar that was created during
 the build process for the examples, and I confirmed that
 org/I0Itec/zkclient/serialize/ZkSerializer is in the uber jar.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/org-I0Itec-zkclient-serialize-ZkSerializer-ClassNotFound-tp15919p16897.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-21 Thread Akhil Das

What about start-all.sh or start-slaves.sh?

Thanks
Best Regards

On Tue, Oct 21, 2014 at 10:25 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 I'm working a cluster where I need to start the workers separately and
 connect them to a master.

 I'm following the instructions here and using branch-1.1

 http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually

 and I can start the master using
 ./sbin/start-master.sh

 When I try to start the slave/worker using
 ./sbin/start-slave.sh it does't work. The logs say that it needs the
 master.
 when I provide
 ./sbin/start-slave.sh spark://master-ip:7077 it still doesn't work.

 I can start the worker using the following command (as described in the
 documentation).

 ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

 Was wondering why start-slave.sh is not working?

 Thanks
 -Soumya

Re: default parallelism bug?

2014-10-21 Thread Olivier Girardot

Hi,
what do you mean by pretty small ? How big is your file ?

Regards,

Olivier.

2014-10-21 6:01 GMT+02:00 Kevin Jung itsjb.j...@samsung.com:

 I use Spark 1.1.0 and set these options to spark-defaults.conf
 spark.scheduler.mode FAIR
 spark.cores.max 48
 spark.default.parallelism 72

 Thanks,
 Kevin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787p16894.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Convert Iterable to RDD

2014-10-21 Thread Olivier Girardot

I don't think this is provided out of the box, but you can use toSeq on
your Iterable and if the Iterable is lazy, it should stay that way for the
Seq.
And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your
RDD.

For the Iterable[Iterable[T]] you can flatten it and then create your RDD
from the corresponding Iterable.

Regards,

Olivier.


2014-10-21 5:07 GMT+02:00 Dai, Kevin yun...@ebay.com:

  In addition, how to convert Iterable[Iterable[T]] to RDD[T]



 Thanks,

 Kevin.



 *From:* Dai, Kevin [mailto:yun...@ebay.com]
 *Sent:* 2014年10月21日 10:58
 *To:* user@spark.apache.org
 *Subject:* Convert Iterable to RDD



 Hi, All



 Is there any way to convert iterable to RDD?



 Thanks,

 Kevin.

Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot

If you already know your keys the best way would be to extract
one RDD per key (it would not bring the content back to the master and you
can take advantage of the caching features) and then execute a
registerTempTable by Key.

But I'm guessing, you don't know the keys in advance, and in this case, I
think it becomes a very confusing point to put everything in different
tables,
First of all - how would you query it afterwards ?

Regards,

Olivier.

2014-10-20 13:02 GMT+02:00 critikaled isasmani@gmail.com:

 Hi I have a rdd which I want to register as multiple tables based on key

 
 val context = new SparkContext(conf)
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(context)
 import sqlContext.createSchemaRDD

 case class KV(key:String,id:String,value:String)
 val logsRDD = context.textFile(logs, 10).map{line=
   val Array(key,id,value) = line split ' '
   (key,id,value)
 }.registerTempTable(KVS)

 I want to store the above information to multiple tables based on key
 without bringing the entire data to master

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread lokeshkumar

Hi All, 

I am trying to run the spark example JavaDecisionTree code using some
external data set. 
It works for certain dataset only with specific maxBins and maxDepth
settings. Even for a working dataset if I add a new data item I get a
ArrayIndexOutOfBounds Exception, I get the same exception for the first case
as well (changing maxBins and maxDepth). I am not sure what is wrong here,
can anyone please explain this. 

Exception stacktrace: 

14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage
7.0 (TID 13) 
java.lang.ArrayIndexOutOfBoundsException: 6301 
at
org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
 
at
org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
 
at
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
 
at
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
at
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
at
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
at
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) 
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) 
at
org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) 
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) 
at
org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
at
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) 
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
at org.apache.spark.scheduler.Task.run(Task.scala:54) 
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0
(TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301 
   
org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
 
   
org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
 
   
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
 
   
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
   
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 
   
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
   
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
scala.collection.Iterator$class.foreach(Iterator.scala:727) 
   
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) 
   
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) 
   
org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) 
   
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) 
   
org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
   
org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) 
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-21 Thread Fengyun RAO

Thanks, Guilaume,

Below is when the exception happens, nothing has spilled to disk yet.

And there isn't a join, but a partitionBy and groupBy action.

Actually if numPartitions is small, it succeeds, while if it's large, it
fails.

Partition was simply done by
override def getPartition(key: Any): Int = {
(key.toString.hashCode  Integer.MAX_VALUE) % numPartitions
}

IndexIDAttemptStatus ▾Locality LevelExecutorLaunch TimeDurationGC Time
AccumulatorsShuffle ReadShuffle Spill (Memory)Shuffle Spill (Disk)Errors99
1730FAILEDPROCESS_LOCALgs-server-10002014/10/21 17:29:561.6 min30 s43.6 MB0.0
B0.0 B

com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.SpecificMutableRow)
otherElements (org.apache.spark.util.collection.CompactBuffer)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)

org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)

org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:203)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:90)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:89)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:625)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:625)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

991751FAILEDPROCESS_LOCALgs-server-10002014/10/21 17:31:292.6 min39 s42.7 MB0.0
B0.0 B

com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.SpecificMutableRow)
otherElements (org.apache.spark.util.collection.CompactBuffer)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)

com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)

[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B

Hi!

The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception (full
stacktrace below):
java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer

Is this function supposed to be available?

Thanks

P.

---


java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
at org.apache.spark.sql.hive.HiveUdafFunction.init(hiveUdfs.scala:334)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
at
org.apache.spark.sql.execution.Aggregate.org$apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab

Hi,
Is there a simple way to run spark sql queries against Sql Server databases? Or 
are we limited to running sql and doing sc.Parallelize()? Being able to query 
small amounts of lookup info directly from spark can save a bunch of annoying 
etl, and I'd expect Spark Sql to have some way of doing this.

Cheers,
Ashic.

Custom s3 endpoint

2014-10-21 Thread bobrik

I have s3-compatible service and I'd like to have access to it in spark.

From what I have gathered, I need to add
s3service.s3-endpoint=my_s3_endpoint to file jets3t.properties in
classpath. I'm not java programmer and I'm not sure where to put it in
hello-world example.

I managed to make it work with local master with this hack:

Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME).setProperty(s3service.s3-endpoint,
my_s3_endpoint);

But this property fails to propagate when I run spark on mesos cluster.
Putting correct jets3t.properties in SPARK_HOME/conf also helps only with
local master mode.

Can anyone help with this issue? Where should I put my j3tset.properties in
java project? That would be super-awesome if Spark could pick up s3 endpoint
from env variables like it does with s3 credentials.

Thanks in advance!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-s3-endpoint-tp16911.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Cheng Lian

Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL 
server. Currently Spark SQL can't run queries against SQL server. The 
foreign data source API planned in Spark 1.2 can make this possible.


On 10/21/14 6:26 PM, Ashic Mahtab wrote:

Hi,
Is there a simple way to run spark sql queries against Sql Server 
databases? Or are we limited to running sql and doing 
sc.Parallelize()? Being able to query small amounts of lookup info 
directly from spark can save a bunch of annoying etl, and I'd expect 
Spark Sql to have some way of doing this.


Cheers,
Ashic.

create a Row Matrix

2014-10-21 Thread viola

Hi, 

I am VERY new to spark and mllib and ran into a couple of problems while
trying to reproduce some examples. I am aware that this is a very simple
question but could somebody please give me an example 
- how to create a RowMatrix in scala with the following entries:
[1 2
3 4]?
I would like to apply an SVD on it.

Thank you very much! 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val personPath = /hdd/spark/person.json
val person = sqlContext.jsonFile(personPath)
person.printSchema()
person.registerTempTable(person)
val addressPath = /hdd/spark/address.json
val address = sqlContext.jsonFile(addressPath)
address.printSchema()
address.registerTempTable(address)
sqlContext.cacheTable(person)
sqlContext.cacheTable(address)
val rs2 = sqlContext.sql(SELECT p.id, p.name, a.city FROM person p, address
a where p.id = a.id limit 10).collect.foreach(println)

person.json
{id:1,name:Mr. X}

address.json
{city:Earth,id:1}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: why fetch failed

2014-10-21 Thread marylucy

thank you 
it works！akka timeout may be bottle-neck in my system

 在 Oct 20, 2014，17:07，Akhil Das ak...@sigmoidanalytics.com 写道：
 
 I used to hit this issue when my data size was too large and the number of 
 partitions was too large (  1200 ), I got ride of it by
 
 - Reducing the number of partitions
 - Setting the following while creating the sparkContext:
   .set(spark.rdd.compress,true)
   .set(spark.storage.memoryFraction,1)
   .set(spark.core.connection.ack.wait.timeout,600)
   .set(spark.akka.frameSize,50)
 
 
 Thanks
 Best Regards
 
 On Sun, Oct 19, 2014 at 6:52 AM, marylucy qaz163wsx_...@hotmail.com wrote:
 When doing groupby for big data,may be 500g,some partition tasks 
 success,some partition tasks fetchfailed error.   Spark system retry 
 previous stage,but always fail
 6 computers : 384g
 Worker:40g*7 for one computer
 
 Can anyone tell me why fetch failed???
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: why fetch failed

2014-10-21 Thread marylucy

thanks
i need check spark 1.1.0 contain it

 在 Oct 21, 2014，0:01，DB Tsai dbt...@dbtsai.com 写道：
 
  I ran into the same issue when the dataset is very big. 
 
 Marcelo from Cloudera found that it may be caused by SPARK-2711, so their 
 Spark 1.1 release reverted SPARK-2711, and the issue is gone. See  
 https://issues.apache.org/jira/browse/SPARK-3633 for detail. 
 
 You can checkout Cloudera's version here 
 https://github.com/cloudera/spark/tree/cdh5-1.1.0_5.2.0
 
 PS, I don't test it yet, but will test it in the following couple days, and 
 report back.
 
 
 Sincerely,
 
 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai
 
 On Sat, Oct 18, 2014 at 6:22 PM, marylucy qaz163wsx_...@hotmail.com wrote:
 When doing groupby for big data,may be 500g,some partition tasks 
 success,some partition tasks fetchfailed error.   Spark system retry 
 previous stage,but always fail
 6 computers : 384g
 Worker:40g*7 for one computer
 
 Can anyone tell me why fetch failed???
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab

Thanks. Didn't know about jdbcrdd...should do nicely for now. The foreign data 
source api looks interesting...

Date: Tue, 21 Oct 2014 20:33:03 +0800
From: lian.cs@gmail.com
To: as...@live.com; user@spark.apache.org
Subject: Re: Getting Spark SQL talking to Sql Server

Instead of using Spark SQL, you can use JdbcRDD to extract data from
SQL server. Currently Spark SQL can't run queries against SQL
server. The foreign data source API planned in Spark 1.2 can make
this possible.

On 10/21/14 6:26 PM, Ashic Mahtab
  wrote:

  Hi,

Is there a simple way to run spark sql queries against Sql
Server databases? Or are we limited to running sql and doing
sc.Parallelize()? Being able to query small amounts of lookup
info directly from spark can save a bunch of annoying etl, and
I'd expect Spark Sql to have some way of doing this.

Cheers,

Ashic.

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-21 Thread Arian Pasquali

That's true Guillaume.
I'm currently aggregating documents considering a week as time range.
I will have to make it daily and aggregate the results later.

thanks for your hints anyway




Arian Pasquali
http://about.me/arianpasquali

2014-10-20 13:53 GMT+01:00 Guillaume Pitel guillaume.pi...@exensa.com:

  Hi,

 The array size you (or the serializer) tries to allocate is just too big
 for the JVM. No configuration can help :

 https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit

 The only option is to split you problem further by increasing parallelism.

 Guillaume

 Hi,
 I’m using Spark 1.1.0 and I’m having some issues to setup memory options.
 I get “Requested array size exceeds VM limit” and I’m probably missing
 something regarding memory configuration
 https://spark.apache.org/docs/1.1.0/configuration.html.

  My server has 30G of memory and this are my current settings.

  ##this one seams that was deprecated
  export SPARK_MEM=‘25g’

  ## worker memory options seams to be the memory for each worker (by
 default we have a worker for each core)
 export SPARK_WORKER_MEMORY=‘5g’

  I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS,
 but I’m not quite sure how.
 I have tried some different options like the following, but I still
 couldn’t make it right:

  export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'
 export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'

  Does anyone has any idea how can I approach this?




  14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 1566 non-empty blocks out of 1566 blocks
 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 4 ms
 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory
 map of 3925 MB to disk (1 time so far)
 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory
 map of 3925 MB to disk (2 times so far)
 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 1566)
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception in thread Thread[Executor task launch worker-2,5,main]
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140


  Arian



 --
[image: eXenSa]
  *Guillaume PITEL, Président*
 +33(0)626 222 431

 eXenSa S.A.S. http://www.exensa.com/
  41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas

Hi Spark !  I found out why my RDD's werent coming through in my spark
stream.

It turns out you need the onStart()  needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread.  For example

def onStartMock():Unit ={
  val future = new Thread(new Runnable() {
def run() {
  for(x - 1 until 10) {
val newMem = Runtime.getRuntime.freeMemory()/12188091;
if(newMem != lastMem){
  System.out.println(in thread :  + newMem);
}
lastMem=newMem;
store(mockStatus);
  }
}});

Hope that helps somebody in the same situation.  FYI Its in the docs :)

 * {{{
 *  class MyReceiver(storageLevel: StorageLevel) extends
NetworkReceiver[String](storageLevel) {
 *  def onStart() {
 *  // Setup stuff (start threads, open sockets, etc.) to start
receiving data.
 *  // Must start new thread to receive data, as onStart() must be
non-blocking.
 *
 *  // Call store(...) in those threads to store received data into
Spark's memory.
 *
 *  // Call stop(...), restart(...) or reportError(...) on any
thread based on how
 *  // different errors needs to be handled.
 *
 *  // See corresponding method documentation for more details
 *  }
 *
 *  def onStop() {
 *  // Cleanup stuff (stop threads, close sockets, etc.) to stop
receiving data.
 *  }
 *  }
 * }}}

Re: How do you write a JavaRDD into a single file

2014-10-21 Thread Steve Lewis

Collect will store the entire output in a List in memory. This solution is
acceptable for Little Data problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.

Most problems which benefit from Spark are large enough that even the data
assigned to a single partition will not fit into memory.

In my special case the output now is in the 0.5 - 4 GB range but in the
future might get to 4 times that size - something a single machine could
write but not hold at one time. I find that for most problems a file like
Part-0001 is not what the next step wants to use - the minute a step is
required to further process that file - even move and rename - there is
little reason not to let the spark code write what is wanted in the first
place.

I like the solution of using toLocalIterator and writing my own file

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav

Hi Tridib,

I changed SQLContext to HiveContext and it started working. These are steps
I used.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val person = sqlContext.jsonFile(json/person.json)
person.printSchema()
person.registerTempTable(person)
val address = sqlContext.jsonFile(json/address.json)
address.printSchema()
address.registerTempTable(address)
sqlContext.cacheTable(person)
sqlContext.cacheTable(address)
val rs2 = sqlContext.sql(select p.id,p.name,a.city from person p join
address a on (p.id = a.id)).collect.foreach(println)


Rishi@InfoObjects

*Pure-play Big Data Consulting*


On Tue, Oct 21, 2014 at 5:47 AM, tridib tridib.sama...@live.com wrote:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val personPath = /hdd/spark/person.json
 val person = sqlContext.jsonFile(personPath)
 person.printSchema()
 person.registerTempTable(person)
 val addressPath = /hdd/spark/address.json
 val address = sqlContext.jsonFile(addressPath)
 address.printSchema()
 address.registerTempTable(address)
 sqlContext.cacheTable(person)
 sqlContext.cacheTable(address)
 val rs2 = sqlContext.sql(SELECT p.id, p.name, a.city FROM person p,
 address
 a where p.id = a.id limit 10).collect.foreach(println)

 person.json
 {id:1,name:Mr. X}

 address.json
 {city:Earth,id:1}



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib

Hmm... I thought HiveContext will only worki if Hive is present. I am curious
to know when to use HiveContext and when to use SqlContext.

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16924.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Cassandra connector issue

2014-10-21 Thread Ankur Srivastava

Hi,

I am creating a cassandra java rdd and transforming it using the where
clause.

It works fine when I run it outside the mapValues, but when I put the code
in mapValues I get an error while creating the transformation.

Below is my sample code:

  CassandraJavaRDDReferenceData cassandraRefTable = javaFunctions(sc
).cassandraTable(reference_data,

 dept_reference_data, ReferenceData.class);

JavaPairRDDString, Employee joinedRdd = rdd.mapValues(new
FunctionIPLocation, IPLocation() {

 public Employee call(Employee employee) throws Exception {

 ReferenceData data = null;

 if(employee.getDepartment() != null) {

   data = referenceTable.where(postal_plus=?, location
.getPostalPlus()).first();

   System.out.println(data.toCSV());

 }

if(data != null) {

  //call setters on employee

}

return employee;

 }

}

I get this error:

java.lang.NullPointerException

at org.apache.spark.rdd.RDD.init(RDD.scala:125)

at com.datastax.spark.connector.rdd.CassandraRDD.init(
CassandraRDD.scala:47)

at com.datastax.spark.connector.rdd.CassandraRDD.copy(CassandraRDD.scala:70)

at com.datastax.spark.connector.rdd.CassandraRDD.where(CassandraRDD.scala:77
)

 at com.datastax.spark.connector.rdd.CassandraJavaRDD.where(
CassandraJavaRDD.java:54)


Thanks for help!!



Regards

Ankur

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Michael Armbrust

No, analytic and window functions do not work yet.

On Tue, Oct 21, 2014 at 3:00 AM, Pierre B 
pierre.borckm...@realimpactanalytics.com wrote:

 Hi!

 The RANK function is available in hive since version 0.11.
 When trying to use it in SparkSQL, I'm getting the following exception
 (full
 stacktrace below):
 java.lang.ClassCastException:
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
 cast to

 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer

 Is this function supposed to be available?

 Thanks

 P.

 ---


 java.lang.ClassCastException:
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
 cast to

 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
 at
 org.apache.spark.sql.hive.HiveUdafFunction.init(hiveUdfs.scala:334)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
 at
 org.apache.spark.sql.execution.Aggregate.org
 $apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
 at

 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
 at

 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

disk-backing pyspark rdds?

2014-10-21 Thread Eric Jonas

Hi All!
I'm getting my feet wet with pySpark for the fairly boring case of
doing parameter sweeps for monte carlo runs. Each of my functions runs for
a very long time (2h+) and return numpy arrays on the order of ~100 MB.
That is, my spark applications look like

def foo(x):
np.random.seed(x)
eat_2GB_of_ram()
take_2h()
return my_100MB_array

sc.parallelize(np.arange(100)).map(f).saveAsPickleFile(s3n://blah...)

The resulting rdds will most likely not fit in memory but for this use case
I don't really care. I know I can persist RDDs, but is there any way to
by-default disk-back them (something analogous to mmap?) so that they don't
create memory pressure in the system at all? With compute taking this long,
the added overhead of disk and network IO is quite minimal.

Thanks!

...Eric Jonas

stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng

what could cause this type of 'stage failure'?  Thanks!

This is a simple py spark script to list data in hbase.
command line: ./spark-submit --driver-class-path
~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 

14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on ip-***.ec2.internal:35201 (size: 1470.0 B, free: 265.4 MB)
14/10/21 17:53:50 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
14/10/21 17:53:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0
(MappedRDD[1] at map at PythonHadoopUtil.scala:185)
14/10/21 17:53:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:34050/user/Executor#681287499]
with ID 0
14/10/21 17:53:53 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:47483/user/Executor#-936252397]
with ID 1
14/10/21 17:53:53 INFO BlockManagerMasterActor: Registering block manager
ip-2.internal:49236 with 3.1 GB RAM
14/10/21 17:53:54 INFO BlockManagerMasterActor: Registering block manager
ip-.ec2.internal:36699 with 3.1 GB RAM
14/10/21 17:53:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-.ec2.internal): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor ip-.internal: java.lang.IllegalStateException (unread block data)
[duplicate 1]
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
2, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on
executor ip-.internal: java.lang.IllegalStateException (unread block data)
[duplicate 2]
14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
3, ip-.internal, ANY, 1264 bytes)
14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on
executor ip-2.internal: java.lang.IllegalStateException (unread block data)
[duplicate 3]
14/10/21 17:53:54 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/10/21 17:53:54 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool 
14/10/21 17:53:54 INFO TaskSchedulerImpl: Cancelling stage 0
14/10/21 17:53:54 INFO DAGScheduler: Failed to run first at
SerDeUtil.scala:70
Traceback (most recent call last):
  File /root/workspace/test/sparkhbase.py, line 17, in module
conf=conf2)
  File /root/spark/python/pyspark/context.py, line 471, in newAPIHadoopRDD
jconf, batchSize)
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, ip-internal): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

Re: stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng

maybe set up a hbase.jar in the conf?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-21 Thread Terry Siu

Just to follow up, the queries worked against master and I got my whole flow 
rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with 
the next release of CDH5  :P

-Terry

From: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Date: Monday, October 20, 2014 at 12:22 PM
To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Hi Michael,

Thanks again for the reply. Was hoping it was something I was doing wrong in 
1.1.0, but I’ll try master.

Thanks,
-Terry

From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes

Have you tried this on master?  There were several problems with resolution of 
complex queries that were registered as tables in the 1.1.0 release.

On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu 
terry@smartfocus.commailto:terry@smartfocus.com wrote:
Hi all,

I’m getting a TreeNodeException for unresolved attributes when I do a simple 
select from a schemaRDD generated by a join in Spark 1.1.0. A little background 
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join 
them, and then perform multiple INSERT-SELECT with GROUP BY to write back out 
to a Hive rollup table that has two partitions. This task is an effort to 
simulate the unsupported GROUPING SETS functionality in SparkSQL.

In my first attempt, I got really close using  SchemaRDD.groupBy until I 
realized that SchemaRDD.insertTo API does not support partitioned tables yet. 
This prompted my second attempt to pass in SQL to the HiveContext.sql API 
instead.

Here’s a rundown of the commands I executed on the spark-shell:


val hc = new HiveContext(sc)

hc.setConf(spark.sql.hive.convertMetastoreParquet, true”)

hc.setConf(spark.sql.parquet.compression.codec, snappy”)


// For implicit conversions to Expression

val sqlContext = new SQLContext(sc)

import sqlContext._


val segCusts = hc.hql(“select …”)

val segTxns = hc.hql(“select …”)


val sc = segCusts.as('sc)

val st = segTxns.as(‘st)


// Join the segCusts and segTxns tables

val rup = sc.join(st, Inner, 
Some(sc.segcustomerid.attr===st.customerid.attr))

rup.registerAsTable(“rupbrand”)



If I do a printSchema on the rup, I get:

root

 |-- segcustomerid: string (nullable = true)

 |-- sales: double (nullable = false)

 |-- tx_count: long (nullable = false)

 |-- storeid: string (nullable = true)

 |-- transdate: long (nullable = true)

 |-- transdate_ts: string (nullable = true)

 |-- transdate_dt: string (nullable = true)

 |-- unitprice: double (nullable = true)

 |-- translineitem: string (nullable = true)

 |-- offerid: string (nullable = true)

 |-- customerid: string (nullable = true)

 |-- customerkey: string (nullable = true)

 |-- sku: string (nullable = true)

 |-- quantity: double (nullable = true)

 |-- returnquantity: double (nullable = true)

 |-- channel: string (nullable = true)

 |-- unitcost: double (nullable = true)

 |-- transid: string (nullable = true)

 |-- productid: string (nullable = true)

 |-- id: string (nullable = true)

 |-- campaign_campaigncost: double (nullable = true)

 |-- campaign_begindate: long (nullable = true)

 |-- campaign_begindate_ts: string (nullable = true)

 |-- campaign_begindate_dt: string (nullable = true)

 |-- campaign_enddate: long (nullable = true)

 |-- campaign_enddate_ts: string (nullable = true)

 |-- campaign_enddate_dt: string (nullable = true)

 |-- campaign_campaigntitle: string (nullable = true)

 |-- campaign_campaignname: string (nullable = true)

 |-- campaign_id: string (nullable = true)

 |-- product_categoryid: string (nullable = true)

 |-- product_company: string (nullable = true)

 |-- product_brandname: string (nullable = true)

 |-- product_vendorid: string (nullable = true)

 |-- product_color: string (nullable = true)

 |-- product_brandid: string (nullable = true)

 |-- product_description: string (nullable = true)

 |-- product_size: string (nullable = true)

 |-- product_subcategoryid: string (nullable = true)

 |-- product_departmentid: string (nullable = true)

 |-- product_productname: string (nullable = true)

 |-- product_categoryname: string (nullable = true)

 |-- product_vendorname: string (nullable = true)

 |-- product_sku: string (nullable = true)

 |-- product_subcategoryname: string (nullable = true)

 |-- product_status: string (nullable = true)

 |-- product_departmentname: string (nullable = true)

 |-- product_style: string (nullable = true)

 |-- product_id: string (nullable = true)

 |-- customer_lastname: string (nullable =

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Michael Armbrust

 Hmm... I thought HiveContext will only worki if Hive is present. I am
 curious
 to know when to use HiveContext and when to use SqlContext.


http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started

TLDR; Always use HiveContext if your application does not have a dependency
conflict with Hive jars. :)

How to set hadoop native library path in spark-1.1

2014-10-21 Thread Pradeep Ch

Hi all,

Can anyone tell me how to set the native library path in Spark.

Right not I am setting it using SPARK_LIBRARY_PATH environmental variable
in spark-env.sh. But still no success.

I am still seeing this in spark-shell.

NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable


Thanks,
Pradeep

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib

Thank for pointing that out.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas

Oh - and one other note on this, which appears to be the case.

If , in your stream forEachRDD implementation, you do something stupid
(like call rdd.count())

tweetStream.foreachRDD((rdd,lent)= {
  tweetStream.repartition(1)
  numTweetsCollected+=1;
  //val count = rdd.count() DONT DO THIS !

You can also get stuck in a situation where your RDD processor blocks
infinitely.

And for twitter specific stuff, make sure to look at modifying the
TwitterInputDStream class
so that it implements the stuff from SPARK-2464, which can lead to infinite
stream reopening as well.



On Tue, Oct 21, 2014 at 11:02 AM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Hi Spark !  I found out why my RDD's werent coming through in my spark
 stream.

 It turns out you need the onStart()  needs to return , it seems - i.e. you
 need to launch the worker part of your
 start process in a thread.  For example

 def onStartMock():Unit ={
   val future = new Thread(new Runnable() {
 def run() {
   for(x - 1 until 10) {
 val newMem = Runtime.getRuntime.freeMemory()/12188091;
 if(newMem != lastMem){
   System.out.println(in thread :  + newMem);
 }
 lastMem=newMem;
 store(mockStatus);
   }
 }});

 Hope that helps somebody in the same situation.  FYI Its in the docs :)

  * {{{
  *  class MyReceiver(storageLevel: StorageLevel) extends
 NetworkReceiver[String](storageLevel) {
  *  def onStart() {
  *  // Setup stuff (start threads, open sockets, etc.) to start
 receiving data.
  *  // Must start new thread to receive data, as onStart() must be
 non-blocking.
  *
  *  // Call store(...) in those threads to store received data
 into Spark's memory.
  *
  *  // Call stop(...), restart(...) or reportError(...) on any
 thread based on how
  *  // different errors needs to be handled.
  *
  *  // See corresponding method documentation for more details
  *  }
  *
  *  def onStop() {
  *  // Cleanup stuff (stop threads, close sockets, etc.) to stop
 receiving data.
  *  }
  *  }
  * }}}




-- 
jay vyas

How to calculate percentiles with Spark?

2014-10-21 Thread sparkuser

Hi,

What would be the best way to get percentiles from a Spark RDD? I can see
JavaDoubleRDD or MLlib's  MultivariateStatisticalSummary
https://spark.apache.org/docs/latest/mllib-statistics.html   provide the
mean() but not percentiles.

Thank you!

Horace



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark-Submit Python along with JAR

2014-10-21 Thread TJ Klein

Hi,

I'd like to run my python script using spark-submit together with a JAR
file containing Java specifications for a Hadoop file system. How can I do
that? It seems I can either provide a JAR file or a PYthon file to
spark-submit.

So far I have been running my code in ipython with IPYTHON_OPTS=notebook
--pylab inline /usr/local/spark/bin/pyspark --jars
/usr/local/spark/HadoopFileFormat.jar

Best,
 Tassilo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Submit-Python-along-with-JAR-tp16938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib

Any help? or comments?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Primitive arrays in Spark

2014-10-21 Thread Akshat Aranya

This is as much of a Scala question as a Spark question

I have an RDD:

val rdd1: RDD[(Long, Array[Long])]

This RDD has duplicate keys that I can collapse such

val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b)

If I start with an Array of primitive longs in rdd1, will rdd2 also have
Arrays of primitive longs?  I suspect, based on my memory usage, that this
is not the case.

Also, would it be more efficient to do this:

val rdd1: RDD[(Long, ArrayBuffer[Long])]

and then

val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) =
a++b).map(_.toArray)

MLLib libsvm format

2014-10-21 Thread Sameer Tilak

Hi All,I have a question regarding the ordering of indices. The document says 
that the indices indices are one-based and in ascending order. However, do the 
indices within a row need to be sorted in ascending order? 
 Sparse dataIt is very common in practice to have sparse training data. MLlib 
supports reading training examples stored in LIBSVM format, which is the 
default format used by LIBSVM and LIBLINEAR. It is a text format in which each 
line represents a labeled sparse feature vector using the following 
format:label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the 
feature indices are converted to zero-based.

For example, I have have indices ranging rom 1 to 1000 is this as a libsvm data 
file OK?
1110:1.0   80:0.5   310:0.00 890:0.5  20:0.0   200:0.5   400:1.0  
82:0.0 and so on:
OR do I need to sort them as:
1  80:0.5   110:1.0   310:0.00  20:0.082:0.0200:0.5   400:1.0  
890:0.5

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B

Ok thanks Michael.

In general, what's the easy way to figure out what's already implemented?

The exception I was getting was not really helpful here?

Also, is there a roadmap document somewhere ?

Thanks!

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread freedafeng

Thanks for the help!

Hadoop version: 2.3.0
Hbase version: 0.98.1

Use python to read/write data from/to hbase. 

Only change over the official spark 1.1.0 is the pom file under examples. 
Compilation: 
spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean
package
spark/examples:mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2
-Dhadoop.version=2.3.0 -DskipTests clean package

I am wondering how I can deploy this version of spark to a new ec2 cluster.
I tried 
./spark-ec2 -k sparkcluster -i ~/sparkcluster.pem -s 1 -v 1.1.0
--hadoop-major-version=2.3.0 --worker-instances=2  -z us-east-1d launch
sparktest1

but this version got a type mismatch error when I read hbase data.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to calculate percentiles with Spark?

2014-10-21 Thread lordjoe

A rather more general question is - assume I have an JavaRDDK which is
sorted -
How can I convert this into a JavaPairRDDInteger,K where the Integer is
tie  index - 0...N - 1.
Easy to do on one machine
 JavaRDDK values = ... // create here

   JavaRDDInteger,K positions = values.mapToPair(new PairFunctionK,
Integer, K() {
private int index = 0;
@Override
public Tuple2Integer, K call(final K t) throws Exception {
return new Tuple2(index++,t);
  }
});
but will this code do the right thing on a cluster



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937p16945.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Class not found

2014-10-21 Thread Pat Ferrel

maven cache is laid out differently but it does work on Linux and BSD/mac.

Still looks like a hack to me.

On Oct 21, 2014, at 1:28 PM, Pat Ferrel p...@occamsmachete.com wrote:

Doesn’t this seem like a dangerous error prone hack? It will build different 
bits on different machines. It doesn’t even work on my linux box because the 
mvn install doesn’t cache the same as on the mac.

If Spark is going to be supported on the maven repos shouldn’t it be addressed 
by different artifacts to support any option that changes the linkage 
info/class naming?

On Oct 21, 2014, at 12:16 PM, Pat Ferrel p...@occamsmachete.com 
mailto:p...@occamsmachete.com wrote:

Not sure if this has been clearly explained here but since I took a day to 
track it down…

Several people have experienced a class not found error on Spark when the class 
referenced is supposed to be in the Spark jars.

One thing that can cause this is if you are building Spark for your cluster 
environment. The instructions say to do a “mvn package …” Instead some of these 
errors can be fixed using the following procedure:

1) delete ~/.m2/repository/org/spark and your-project
2) build Spark for your version of Hadoop *but do not use mvn package ...”* 
use “mvn install …” This will put a copy of the exact bits you need into the 
maven cache for building your-project against. In my case using hadoop 1.2.1 it 
was mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on 
Spark some failures can safely be ignored so check before giving up. 
3) build your-project with “mvn clean install

com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001

I am running a simple rdd filter command. What does it mean? 
Here is the full stack trace(and code below it): 

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 133
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

*Here is the code of the main function:*

/String comparisonFieldIndexes = 16,18;
String segmentFieldIndexes = 14,15;
String comparisonFieldWeights = 50, 50;
String delimiter = +'\001';

PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70,
comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes,
delimiter);

JavaRDDString filtered_rdd = origRDD.filter(parOnCol.new
FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) );

parOnCol.printRDD(filtered_rdd);/


*Here is the FilterEmptyFields class:*

/public class FilterEmptyFields implements FunctionString, Boolean
{

final int[] nonEmptyFields;
final String DELIMITER;

public FilterEmptyFields(int[] nonEmptyFields, String 
delimiter){
this.nonEmptyFields = nonEmptyFields;
this.DELIMITER = delimiter;
}

@Override
public Boolean call(String s){

String[] fields = s.split(DELIMITER);

for(int i=0; inonEmptyFields.length; i++){
if(fields[nonEmptyFields[i]] == null  ||
fields[nonEmptyFields[i]].isEmpty()){
return false;
}
}

return true;
}

}lt;/i

Any suggestions guys?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SchemaRDD.where clause error

2014-10-21 Thread Kevin Paul

Hi all, I tried to use the function SchemaRDD.where() but got some error:

  val people = sqlCtx.sql(select * from people)
  people.where('age === 10)

console:27: error: value === is not a member of Symbol

where did I go wrong?

Thanks,
Kevin Paul

buffer overflow when running Kmeans

2014-10-21 Thread Yang

this is the stack trace I got with yarn logs -applicationId

really no idea where to dig further.

thanks!
yang

14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [
phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21]
14/10/21 14:36:47 ERROR Executor: Exception in task ID 98
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 3,
required: 8
Serialization trace:
data$mcD$sp (breeze.linalg.DenseVector$mcD$sp)
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
at
com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:142)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Re: SchemaRDD.where clause error

2014-10-21 Thread Michael Armbrust

You need to import sqlCtx._ to get access to the implicit conversion.

On Tue, Oct 21, 2014 at 2:40 PM, Kevin Paul kevinpaulap...@gmail.com
wrote:

 Hi all, I tried to use the function SchemaRDD.where() but got some error:

   val people = sqlCtx.sql(select * from people)
   people.where('age === 10)

 console:27: error: value === is not a member of Symbol

 where did I go wrong?

 Thanks,
 Kevin Paul

Re: buffer overflow when running Kmeans

2014-10-21 Thread Ted Yu

Just posted below for a similar question.

Have you seen this thread ?

http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflowsubj=RE+spark+nbsp+kryo+serilizable+nbsp+exception

On Tue, Oct 21, 2014 at 2:44 PM, Yang tedd...@gmail.com wrote:

 this is the stack trace I got with yarn logs -applicationId

 really no idea where to dig further.

 thanks!
 yang

 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [
 phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21]
 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98
 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 3,
 required: 8
 Serialization trace:
 data$mcD$sp (breeze.linalg.DenseVector$mcD$sp)
 at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
 at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
 at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200)
 at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
 at
 com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
 at
 com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
 at
 com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
 at
 com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
 at
 com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
 at
 com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
 at
 com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
 at
 com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
 at
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:142)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

How to read BZ2 XML file in Spark?

2014-10-21 Thread John Roberts

Hi,

I want to ingest Open Street Map. It's 43GB (compressed) XML in BZIP2
format. What's your advice for reading it in to an RDD?

BTW, the Spark Training at UMD is awesome! I'm having a blast learning
Spark. I wish I could go to the MeetUp tonight, but I have kid activities...

http://wiki.openstreetmap.org/wiki/Planet.osm

http://wiki.openstreetmap.org/wiki/OSM_XML

John
--
John S. Roberts
SigInt Technologies LLC, a Novetta Solutions Company
8830 Stanford Blvd, Suite 306; Columbia, MD 21045 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai

Is there any specific issues you are facing?

Thanks,

Yin

On Tue, Oct 21, 2014 at 4:00 PM, tridib tridib.sama...@live.com wrote:

 Any help? or comments?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib libsvm format

2014-10-21 Thread Xiangrui Meng

Yes. where the indices are one-based and **in ascending order**. -Xiangrui

On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote:
 Hi All,

 I have a question regarding the ordering of indices. The document says that
 the indices indices are one-based and in ascending order. However, do the
 indices within a row need to be sorted in ascending order?




 Sparse data

 It is very common in practice to have sparse training data. MLlib supports
 reading training examples stored in LIBSVM format, which is the default
 format used by LIBSVM and LIBLINEAR. It is a text format in which each line
 represents a labeled sparse feature vector using the following format:

 label index1:value1 index2:value2 ...

 where the indices are one-based and in ascending order. After loading, the
 feature indices are converted to zero-based.



 For example, I have have indices ranging rom 1 to 1000 is this as a libsvm
 data file OK?


 1110:1.0   80:0.5   310:0.0

 0 890:0.5  20:0.0   200:0.5   400:1.0  82:0.0

 and so on:


 OR do I need to sort them as:


 1  80:0.5   110:1.0   310:0.0

 0  20:0.082:0.0200:0.5   400:1.0  890:0.5

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark ui redirecting to port 8100

2014-10-21 Thread sadhan

Set up the spark port to a different one and the connection seems successful
but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as
well.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: create a Row Matrix

2014-10-21 Thread Xiangrui Meng

Please check out the example code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala
-Xiangrui

On Tue, Oct 21, 2014 at 5:34 AM, viola viola.wiersc...@siemens.com wrote:
 Hi,

 I am VERY new to spark and mllib and ran into a couple of problems while
 trying to reproduce some examples. I am aware that this is a very simple
 question but could somebody please give me an example
 - how to create a RowMatrix in scala with the following entries:
 [1 2
 3 4]?
 I would like to apply an SVD on it.

 Thank you very much!





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: MLLib libsvm format

2014-10-21 Thread Sameer Tilak

Great, I will sort them.


Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone

div Original message /divdivFrom: Xiangrui Meng 
men...@gmail.com /divdivDate:10/21/2014  3:29 PM  (GMT-08:00) 
/divdivTo: Sameer Tilak ssti...@live.com /divdivCc: 
user@spark.apache.org /divdivSubject: Re: MLLib libsvm format /divdiv
/div
Yes. where the indices are one-based and **in ascending order**. -Xiangrui

On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote:
 Hi All,

 I have a question regarding the ordering of indices. The document says that
 the indices indices are one-based and in ascending order. However, do the
 indices within a row need to be sorted in ascending order?




 Sparse data

 It is very common in practice to have sparse training data. MLlib supports
 reading training examples stored in LIBSVM format, which is the default
 format used by LIBSVM and LIBLINEAR. It is a text format in which each line
 represents a labeled sparse feature vector using the following format:

 label index1:value1 index2:value2 ...

 where the indices are one-based and in ascending order. After loading, the
 feature indices are converted to zero-based.



 For example, I have have indices ranging rom 1 to 1000 is this as a libsvm
 data file OK?


 1110:1.0   80:0.5   310:0.0

 0 890:0.5  20:0.0   200:0.5   400:1.0  82:0.0

 and so on:


 OR do I need to sort them as:


 1  80:0.5   110:1.0   310:0.0

 0  20:0.082:0.0200:0.5   400:1.0  890:0.5

Re: How to read BZ2 XML file in Spark?

2014-10-21 Thread sameerf

Hi John,

Glad you're enjoying the Spark training at UMD.

Is the 43 GB XML data in a single file or split across multiple BZIP2 files?
Is the file in a HDFS cluster or on a single linux machine?

If you're using BZIP2 with splittable compression (in HDFS), you'll need at
least Hadoop 1.1:
https://issues.apache.org/jira/browse/HADOOP-7823

Or if you've got the file on a single linux machine, perhaps consider
uncompressing it manually using cmd line tools before loading it into Spark.

You'll want to start with maybe 1 GB for each partition, so if the
uncompressed file is 100GB, maybe start with 100 partitions. Even if the
entire dataset is in one file (which might get read into just 1 or 2
partitions initially with Spark), you can use the repartition(numPartitions)
transformation to make 100 partitions.

Then you'll have to make sense of the XML schema. You have a few options to
do this.

You can take advantage of Scala’s XML functionality provided by the
scala.xml package to parse the data. Here is a blog post with some code
example for this:
http://stevenskelton.ca/real-time-data-mining-spark/

Or try sc.wholeTextFiles(). It reads the entire file into a string record.
You'll want to make sure that you have enough memory to read the single
string into memory.

This Cloudera blog post about half-way down has some Regex examples of how
to use Scala to parse an XML file into a collection of tuples:
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

You can also search for XMLInputFormat on Google. There are some
implementations that allow you to specify the tag to split on, e.g.:
https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

Good luck!

Sameer F.
Client Services @ Databricks

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-BZ2-XML-file-in-Spark-tp16954p16960.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark ui redirecting to port 8100

2014-10-21 Thread Sameer Farooqui

Hi Sadhan,

Which port are you specifically trying to redirect? The driver program has
a web UI, typically on port 4040... or the Spark Standalone Cluster Master
has a UI exposed on port 7077.

Which setting did you update in which file to make this change?

And finally, which version of Spark are you on?

Sameer F.
Client Services @ Databricks

On Tue, Oct 21, 2014 at 3:29 PM, sadhan sadhan.s...@gmail.com wrote:

 Set up the spark port to a different one and the connection seems
 successful
 but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as
 well.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari

Hello,

Spark 1.1.0, Hadoop 2.4.1

I have written a Spark streaming application. And I am getting
FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
Here is brief what I am is trying to do.
My application is creating text file stream using Java Stream context. The
input file is on HDFS.

JavaDStreamString textStream = ssc.textFileStream(InputFile);

Then it is comparing each line of input stream with some data and filtering
it. The filtered data I am storing in JavaDStreamString.

 JavaDStreamString suspectedStream= textStream.flatMap(new
FlatMapFunctionString,String(){
@Override
public IterableString call(String line) throws
Exception {

ListString filteredList = new ArrayListString();

// doing filter job

return filteredList;
}

And this filteredList I am storing in HDFS as:

 suspectedStream.foreach(new
FunctionJavaRDDlt;String,Void(){
@Override
public Void call(JavaRDDString rdd) throws
Exception {
rdd.saveAsTextFile(outputFolderPath);
return null;
}});


But with this I am receiving 
org.apache.hadoop.mapred.FileAlreadyExistsException.

I tried with appending random number with outputFolderPath and its working. 
But my requirement is to collect all output in one directory. 

Can you please suggest if there is any way to get rid of this exception ?

Thanks,
  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread sameerf

Hi,

Can you post what the error looks like?


Sameer F.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread Joseph Bradley

Hi, this sounds like a bug which has been fixed in the current master.
What version of Spark are you using?  Would it be possible to update to the
current master?
If not, it would be helpful to know some more of the problem dimensions
(num examples, num features, feature types, label type).
Thanks,
Joseph


On Tue, Oct 21, 2014 at 2:42 AM, lokeshkumar lok...@dataken.net wrote:

 Hi All,

 I am trying to run the spark example JavaDecisionTree code using some
 external data set.
 It works for certain dataset only with specific maxBins and maxDepth
 settings. Even for a working dataset if I add a new data item I get a
 ArrayIndexOutOfBounds Exception, I get the same exception for the first
 case
 as well (changing maxBins and maxDepth). I am not sure what is wrong here,
 can anyone please explain this.

 Exception stacktrace:

 14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage
 7.0 (TID 13)
 java.lang.ArrayIndexOutOfBoundsException: 6301
 at

 org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
 at

 org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
 at

 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
 at

 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 at

 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 at

 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at

 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at

 org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
 at
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at

 org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
 at
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 at

 org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0
 (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301


 org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)


 org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)


 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)


 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)


 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)


 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)


 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)


 org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)


 org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)

 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)


 org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)


 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)


 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)


 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)

Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Sameer Farooqui

Hi Shailesh,

Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.

This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a previous job, so it
enforces that the dir should not exist.

Easiest way to get around this may be to just write the results from each
Spark app to a newly named directory, then on an interval run a simple
script to merge data from multiple HDFS directories into one directory.

This HDFS command will let you do something like a directory merge:
hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal -
/newfolderpath/file

See this StackOverflow discussion for a way to do it using Pig and Bash
scripting also:
https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder


Sameer F.
Client Services @ Databricks

On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari sbir...@wynyardgroup.com
wrote:

 Hello,

 Spark 1.1.0, Hadoop 2.4.1

 I have written a Spark streaming application. And I am getting
 FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
 Here is brief what I am is trying to do.
 My application is creating text file stream using Java Stream context. The
 input file is on HDFS.

 JavaDStreamString textStream = ssc.textFileStream(InputFile);

 Then it is comparing each line of input stream with some data and filtering
 it. The filtered data I am storing in JavaDStreamString.

  JavaDStreamString suspectedStream=
 textStream.flatMap(new
 FlatMapFunctionString,String(){
 @Override
 public IterableString call(String line) throws
 Exception {

 ListString filteredList = new
 ArrayListString();

 // doing filter job

 return filteredList;
 }

 And this filteredList I am storing in HDFS as:

  suspectedStream.foreach(new
 FunctionJavaRDDlt;String,Void(){
 @Override
 public Void call(JavaRDDString rdd) throws
 Exception {
 rdd.saveAsTextFile(outputFolderPath);
 return null;
 }});


 But with this I am receiving
 org.apache.hadoop.mapred.FileAlreadyExistsException.

 I tried with appending random number with outputFolderPath and its working.
 But my requirement is to collect all output in one directory.

 Can you please suggest if there is any way to get rid of this exception ?

 Thanks,
   Shailesh




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Using the DataStax Cassandra Connector from PySpark

2014-10-21 Thread Mike Sukmanowsky

Hi there,

I'm using Spark 1.1.0 and experimenting with trying to use the DataStax
Cassandra Connector (https://github.com/datastax/spark-cassandra-connector)
from within PySpark.

As a baby step, I'm simply trying to validate that I have access to classes
that I'd need via Py4J. Sample python program:


from py4j.java_gateway import java_import

from pyspark.conf import SparkConf
from pyspark import SparkContext

conf = SparkConf().set(spark.cassandra.connection.host, 127.0.0.1)
sc = SparkContext(appName=Spark + Cassandra Example, conf=conf)
java_import(sc._gateway.jvm, com.datastax.spark.connector.*)
print sc._jvm.CassandraRow()




CassandraRow corresponds to
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala
which is included in the JAR I submit. Feel free to download the JAR here
https://dl.dropboxusercontent.com/u/4385786/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar

I'm currently running this Python example with:



spark-submit
--driver-class-path=/path/to/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar
--verbose src/python/cassandara_example.py



But continually get the following error indicating that the classes aren't
in fact on the classpath of the GatewayServer:



Traceback (most recent call last):
  File
/Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py,
line 37, in module
main()
  File
/Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py,
line 25, in main
print sc._jvm.CassandraRow()
  File
/Users/mikesukmanowsky/.opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.



The correct response from the GatewayServer should be:


In [22]: gateway.jvm.CassandraRow()
Out[22]: JavaObject id=o0



Also tried using --jars option instead and that doesn't seem to work
either. Is there something I'm missing as to why the classes aren't
available?


-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*p*: +1 (416) 953-4248
*e*: mike.sukmanow...@gmail.com

facebook http://facebook.com/mike.sukmanowsky | twitter
http://twitter.com/msukmanowsky | LinkedIn
http://www.linkedin.com/profile/view?id=10897143 | github
https://github.com/msukmanowsky

Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia

It seems that ++ does the right thing on arrays of longs, and gives you another 
one:

scala val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)

scala val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)

scala a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)

scala res0.getClass
res1: Class[_ : Array[Long]] = class [J

The problem might be that lots of intermediate space is allocated as you merge 
values two by two. In particular, if a key has N arrays mapping to it, your 
code will allocate O(N^2) space because it builds first an array of size 1, 
then 2, then 3, etc. You can make this faster by using aggregateByKey instead, 
and using an intermediate data structure other than an Array to do the merging 
(ideally you'd find a growable ArrayBuffer-like class specialized for Longs, 
but you can also just try ArrayBuffer).

Matei



 On Oct 21, 2014, at 1:08 PM, Akshat Aranya aara...@gmail.com wrote:
 
 This is as much of a Scala question as a Spark question
 
 I have an RDD:
 
 val rdd1: RDD[(Long, Array[Long])]
 
 This RDD has duplicate keys that I can collapse such
 
 val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b)
 
 If I start with an Array of primitive longs in rdd1, will rdd2 also have 
 Arrays of primitive longs?  I suspect, based on my memory usage, that this is 
 not the case.
 
 Also, would it be more efficient to do this:
 
 val rdd1: RDD[(Long, ArrayBuffer[Long])]
 
 and then
 
 val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = 
 a++b).map(_.toArray)
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari

Thanks Sameer for quick reply.

I will try to implement it.

  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql not able to find classes with --jars option

2014-10-21 Thread sadhan

It was mainly because spark was setting the jar classes in a thread local
context classloader. The quick fix was to make our serde use the context
classloader first.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-not-able-to-find-classes-with-jars-option-tp16839p16972.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Strategies for reading large numbers of files

2014-10-21 Thread Landon Kuhn

Thanks to folks here for the suggestions. I ended up settling on what seems
to be a simple and scalable approach. I am no longer using
sparkContext.textFiles with wildcards (it is too slow when working with a
large number of files). Instead, I have implemented directory traversal as
a Spark job, which enables it to parallelize across the cluster.

First, a couple of functions. One to traverse directories, and another to
get the lines in a file:

  def list_file_names(path: String): Seq[String] = {
val fs = FileSystem.get(new java.net.URI(path), hadoopConfiguration)
def f(path: Path): Seq[String] = {
  Option(fs.listStatus(path)).getOrElse(Array[FileStatus]()).
  flatMap {
case fileStatus if fileStatus.isDir ⇒ f(fileStatus.getPath)
case fileStatus ⇒ Seq(fileStatus.getPath.toString)
  }
}
f(new Path(path))
  }

  def read_log_file(path: String): Seq[String] = {
val fs = FileSystem.get(new java.net.URI(path), hadoopConfiguration)
val file = fs.open(new Path(path))
val source = Source.fromInputStream(file)
source.getLines.toList
  }

Next, I generate a list of root paths to scan:

  val paths =
for {
  record_type ← record_types
  year ← years
  month ← months
  day ← days
  hour ← hours
} yield ss3n://s3-bucket-name/$record_type/$year/$month/$day/$hour/
  }

(In this case, I generate one path per hour per record type.)

Finally, using Spark, I can build an RDD with the contents of every file in
the path list:

val rdd: RDD[String] =
sparkContext.
parallelize(paths, paths.size).
flatMap(list_file_names).
flatMap(read_log_file)

I am posting this info here with the hope that it will be useful to
somebody in the future.

L


On Tue, Oct 7, 2014 at 12:58 AM, deenar.toraskar deenar.toras...@db.com
wrote:

 Hi Landon

 I had a problem very similar to your, where we have to process around 5
 million relatively small files on NFS. After trying various options, we did
 something similar to what Matei suggested.

 1) take the original path and find the subdirectories under that path and
 then parallelize the resulting list. you can configure the depth you want
 to
 go down to before sending the paths across the cluster.

   def getFileList(srcDir:File, depth:Int) : List[File] = {
 var list : ListBuffer[File] = new ListBuffer[File]()
 if (srcDir.isDirectory()) {
 srcDir.listFiles() .foreach((file: File) =
if (file.isFile()) {
   list +=(file)
} else {
   if (depth  0 ) {
  list ++= getFileList(file, (depth- 1 ))
   }
else if (depth  0) {
 list ++= getFileList(file, (depth))
   }
else {
   list += file
}
 })
 }
 else {
list += srcDir
 }
 list .toList
   }





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
*Landon Kuhn*, *Software Architect*, Janrain, Inc. http://bit.ly/cKKudR
E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025
Follow Janrain: Facebook http://bit.ly/9CGHdf | Twitter
http://bit.ly/9umxlK | YouTube http://bit.ly/N0OiBT | LinkedIn
http://bit.ly/a7WZMC | Blog http://bit.ly/OI2uOR
Follow Me: LinkedIn http://www.linkedin.com/in/landonkuhn
-
*Acquire, understand, and engage your users. Watch our video
http://bit.ly/janrain-overview or sign up for a live demo
http://bit.ly/janraindemo to see what it's all about.*

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib

Yes, I am unable to use jsonFile() so that it can detect date type
automatically from json data.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming Applications

2014-10-21 Thread Saiph Kappa

Hi,

I have been trying to find a fairly complex application that makes use of
the Spark Streaming framework. I checked public github repos but the
examples I found were too simple, only comprising simple operations like
counters and sums. On the Spark summit website, I could find very
interesting projects, however no source code was available.

Where can I find non-trivial spark streaming application code? Is it that
difficult?

Thanks.

spark 1.1.0 RDD and Calliope 1.1.0-CTP-U2-H2

2014-10-21 Thread Tian Zhang

Hi, I am using the latest calliope library from tuplejump.com to create RDD
for cassandra table.
I am on a 3 nodes spark 1.1.0 with yarn.

My cassandra table is defined as below and I have about 2000 rows of data
inserted.
CREATE TABLE top_shows (
  program_id varchar,
  view_minute timestamp,
  view_count counter,
  PRIMARY KEY (view_minute, program_id)   //note that view_minute is the
partition key
);

Here are the simple steps I ran from spark-shell on master node

spark-shell --master yarn-client --jars
rna/rna-spark-streaming-assembly-1.0-SNAPSHOT.jar --driver-memory 512m 
--executor-memory 512m --num-executors 3 --executor-cores 1

// Import the necessary 
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import com.tuplejump.calliope.utils.RichByteBuffer._
import com.tuplejump.calliope.Implicits._
import com.tuplejump.calliope.CasBuilder
import com.tuplejump.calliope.Types.{CQLRowKeyMap, CQLRowMap}

// Define my class and the implicit cast 
case class ProgramViewCount(viewMinute:Long, program:String, viewCount:Long)
implicit def keyValtoProgramViewCount(key:CQLRowKeyMap,
values:CQLRowMap):ProgramViewCount =
   ProgramViewCount(key.get(view_minute).get.getLong,
key.get(program_id).toString, values.get(view_count).get.getLong)

// Use the cql3 interface to read from table with WHERE predicate.
val cas = CasBuilder.cql3.withColumnFamily(streaming_qa,
top_shows).onHost(23.22.120.96)
.where(view_minute = 141386178)
val allPrograms = sc.cql3Cassandra[ProgramViewCount](cas)

// Lazy  evaluation till this point
val rowCount = allPrograms.count

I hit the following exception. It seems that it does not like my where
clause. If I do not have the 
WHERE CLAUSE, it works fine. But with the WHERE CLAUSE, no matter the
predicate is on 
partition key or not, it will fail with the following exception.

Anyone else using calliope package can share some lights? Thanks a lot.

Tian

scala val rowCount = allPrograms.count

14/10/21 23:26:07 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0
(TID 2, ip-10-187-51-136.ec2.internal): java.lang.RuntimeException: 
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665)
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.init(CqlPagingRecordReader.java:301)
   
com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
   
com.tuplejump.calliope.cql3.Cql3CassandraRDD$$anon$1.init(Cql3CassandraRDD.scala:75)
   
com.tuplejump.calliope.cql3.Cql3CassandraRDD.compute(Cql3CassandraRDD.scala:64)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-RDD-and-Calliope-1-1-0-CTP-U2-H2-tp16975.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Cassandra connector issue

2014-10-21 Thread Ankur Srivastava

Is this because I am calling a transformation function on an rdd from
inside another transformation function?

Is it not allowed?

Thanks
Ankut
On Oct 21, 2014 1:59 PM, Ankur Srivastava ankur.srivast...@gmail.com
wrote:

 Hi Gerard,

 this is the code that may be helpful.

 public class ReferenceDataJoin implements Serializable {


  private static final long serialVersionUID = 1039084794799135747L;

 JavaPairRDDString, Employee rdd;

 CassandraJavaRDDReferenceData referenceTable;


  public PostalReferenceDataJoin(ListEmployee employees) {

  JavaSparkContext sc =
 SparkContextFactory.getSparkContextFactory().getSparkContext();

  this.rdd = sc.parallelizePairs(employees);

  this. referenceTable = javaFunctions(sc).cassandraTable(reference_data,

  “dept_reference_data, ReferenceData.class);

 }


  public JavaPairRDDString, Employee execute() {

 JavaPairRDDString, Employee joinedRdd = rdd

 .mapValues(new FunctionEmployee, Employee() {

 private static final long serialVersionUID = -226016490083377260L;


 @Override

 public Employee call(Employee employee)

 throws Exception {

 ReferenceData data = null;

 if (employee.getDepartment() != null) {

 data = referenceTable.where(“dept=?,

 employee.getDepartment()).first();;

 System.out.println(employee.getDepartment() +  + data);

 }

 if (data != null) {

 //setters on employee

 }

 return employee;

 }

   });

  return joinedRdd;

 }


 }


 Thanks
 Ankur

 On Tue, Oct 21, 2014 at 11:11 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Looks like that code does not correspond to the problem you're facing. I
 doubt it would even compile.
 Could you post the actual code?

 -kr, Gerard
 On Oct 21, 2014 7:27 PM, Ankur Srivastava ankur.srivast...@gmail.com
 wrote:

 Hi,

 I am creating a cassandra java rdd and transforming it using the where
 clause.

 It works fine when I run it outside the mapValues, but when I put the
 code in mapValues I get an error while creating the transformation.

 Below is my sample code:

   CassandraJavaRDDReferenceData cassandraRefTable = javaFunctions(sc
 ).cassandraTable(reference_data,

  dept_reference_data, ReferenceData.class);

 JavaPairRDDString, Employee joinedRdd = rdd.mapValues(new
 FunctionIPLocation, IPLocation() {

  public Employee call(Employee employee) throws Exception {

  ReferenceData data = null;

  if(employee.getDepartment() != null) {

data = referenceTable.where(postal_plus=?, location
 .getPostalPlus()).first();

System.out.println(data.toCSV());

  }

 if(data != null) {

   //call setters on employee

 }

 return employee;

  }

 }

 I get this error:

 java.lang.NullPointerException

 at org.apache.spark.rdd.RDD.init(RDD.scala:125)

 at com.datastax.spark.connector.rdd.CassandraRDD.init(
 CassandraRDD.scala:47)

 at com.datastax.spark.connector.rdd.CassandraRDD.copy(
 CassandraRDD.scala:70)

 at com.datastax.spark.connector.rdd.CassandraRDD.where(
 CassandraRDD.scala:77)

  at com.datastax.spark.connector.rdd.CassandraJavaRDD.where(
 CassandraJavaRDD.java:54)


 Thanks for help!!



 Regards

 Ankur

Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai

Add one more thing about question 1. Once you get the SchemaRDD from
jsonFile/jsonRDD, you can use CAST(columnName as DATE) in your query to
cast the column type from the StringType to DateType (the string format
should be -[m]m-[d]d and you need to use hiveContext). Here is the
code snippet that may help.

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val schemaRDD = hiveContext.jsonFile(...)
schemaRDD.registerTempTable(jsonTable)
hiveContext.sql(SELECT CAST(columnName as DATE) FROM jsonTable)

Thanks,

Yin

On Tue, Oct 21, 2014 at 8:00 PM, Yin Huai huaiyin@gmail.com wrote:

Hello Tridib,

I just saw this one.

1. Right now, jsonFile and jsonRDD do not detect date type. Right now,
IntegerType, LongType, DoubleType, DecimalType, StringType, BooleanType,
StructType and ArrayType will be automatically detected.
2. The process of inferring schema will pass the entire dataset once to
determine the schema. So, you will see a join is launched. Applying a
specific schema to a dataset does not have this cost.
3. It is hard to comment on it without seeing your implementation. For our
built-in JSON support, jsonFile and jsonRDD provides a very convenient way
to work with JSON datasets with SQL. You do not need to define the schema
in advance and Spark SQL will automatically create the SchemaRDD for your
dataset. You can start to query it with SQL by simply registering the
returned SchemaRDD as a temp table. Regarding the implementation, we use a
high performance JSON lib (Jackson, https://github.com/FasterXML/jackson)
to parse JSON records.

Thanks,

Yin

On Mon, Oct 20, 2014 at 10:56 PM, tridib tridib.sama...@live.com wrote:

Hi Spark SQL team,
I trying to explore automatic schema detection for json document. I have
few
questions:
1. What should be the date format to detect the fields as date type?
2. Is automatic schema infer slower than applying specific schema?
3. At this moment I am parsing json myself using map Function and creating
schema RDD from the parsed JavaRDD. Is there any performance impact not
using inbuilt jsonFile()?

Thanks
Tridib

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark - HiveContext - Unstructured Json

2014-10-21 Thread Cheng Lian

You can resort to |SQLContext.jsonFile(path: String, samplingRate: 
Double)| and set |samplingRate| to 1.0, so that all the columns can be 
inferred.


You can also use |SQLContext.applySchema| to specify your own schema 
(which is a |StructType|).


On 10/22/14 5:56 AM, Harivardan Jayaraman wrote:


Hi,
I have unstructured JSON as my input which may have extra columns row 
to row. I want to store these json rows using HiveContext so that it 
can be accessed from the JDBC Thrift Server.
I notice there are primarily only two methods available on the 
SchemaRDD for data - saveAsTable and insertInto. One defines the 
schema while the other can be used to insert in to the table, but 
there is no way to Alter the table and add columns to it.

How do I do this?

One option that I thought of is to write native CREATE TABLE... and 
ALTER TABLE.. statements but just does not seem feasible because at 
every step, I will need to query Hive to determine what is the current 
schema and make a decision whether I should add columns to it or not.


Any thoughts? Has anyone been able to do this?

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Peng Cheng

Looks like the only way is to implement that feature. There is no way of
hacking it into working



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread Koert Kuipers

you ran out of kryo buffer. are you using spark 1.1 (which supports buffer
resizing) or spark 1.0 (which has a fixed size buffer)?
On Oct 21, 2014 5:30 PM, nitinkak001 nitinkak...@gmail.com wrote:

 I am running a simple rdd filter command. What does it mean?
 Here is the full stack trace(and code below it):

 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
 required: 133
 at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
 at
 com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
 at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
 at

 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
 at

 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
 at
 com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
 at

 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

 *Here is the code of the main function:*

 /String comparisonFieldIndexes = 16,18;
 String segmentFieldIndexes = 14,15;
 String comparisonFieldWeights = 50, 50;
 String delimiter = +'\001';

 PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70,
 comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes,
 delimiter);

 JavaRDDString filtered_rdd = origRDD.filter(parOnCol.new
 FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) );

 parOnCol.printRDD(filtered_rdd);/


 *Here is the FilterEmptyFields class:*

 /public class FilterEmptyFields implements FunctionString,
 Boolean
 {

 final int[] nonEmptyFields;
 final String DELIMITER;

 public FilterEmptyFields(int[] nonEmptyFields, String
 delimiter){
 this.nonEmptyFields = nonEmptyFields;
 this.DELIMITER = delimiter;
 }

 @Override
 public Boolean call(String s){

 String[] fields = s.split(DELIMITER);

 for(int i=0; inonEmptyFields.length; i++){
 if(fields[nonEmptyFields[i]] == null  ||
 fields[nonEmptyFields[i]].isEmpty()){
 return false;
 }
 }

 return true;
 }

 }lt;/i

 Any suggestions guys?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread lokeshkumar

Hi Joseph

I am using spark 1.1.0 the latest version, I will try to update to the
current master and check.

The example I am running is JavaDecisionTree, the dataset is of libsvm
format containing 

1. 45 instances of training sample. 
2. 5 features
3. I am not sure what is feature type, but there are no categorical features
being passed in the example.
4. Three labels, not sure what label type is.

The example runs fine with 100 maxBins as value, but when I change this to
say 50 or 30 I get the exception.
Also could you please let me know what should be the default value for
maxBins(API says 100 as default but it did not work in this case)?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907p16988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Num-executors and executor-cores overwritten by defaults

2014-10-21 Thread Ilya Ganelin

Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can
no longer set the number of executors or executor-cores. No matter what
values I pass on the command line to spark they are overwritten by the
defaults. Does anyone have any idea what could have happened here? Running
on Spark 1.02 before I had no trouble.

Also I am able to launch the spark shell without these parameters being
overwritten.

spark sql query optimization , and decision tree building

2014-10-21 Thread sanath kumar

Hi all ,

I have a large data in text files (1,000,000 lines) .Each line has 128
columns . Here each line is a feature and each column is a dimension.

I have converted the txt files in json format and able to run sql queries
on json files using spark.

Now i am trying to build a k dimenstion decision tree (kd tree) with this
large data .

My steps :
1) calculate variance of each column pick the column with maximum variance
and make it as key of first node , and mean of the column as the value of
the node.
2) based on the first node value split the data into 2 parts an repeat the
process until you reach a point.

My sample code :

import sqlContext._
val people = sqlContext.jsonFile(siftoutput/)
people.printSchema()
people.registerTempTable(people)
val output = sqlContext.sql(SELECT * From people)

My Questions :

1) How to save result values of a query into a list ?
2) How to calculate variance of a column .Is there any efficient way?
3) i will be running multiple queries on same data .Does spark has any
way to optimize it ?
4) how to save the output as key value pairs in a text file ?

5) is there any way i can build decision kd tree using machine
libraries of spark ?

please help

Thanks,

Sanath

Re: create a Row Matrix

2014-10-21 Thread viola

Thanks for the quick response. However, I still only get error messages. I am
able to load a .txt file with entries in it and use it in sparks, but I am
not able to create a simple matrix, for instance a 2x2 row matrix
[1 2
3 4]
I tried variations such as 
val RowMatrix = Matrix(2,2,array(1,3,2,4))
but it doesn't work..





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913p16993.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

74 matches

Mail list logo