Re: org/I0Itec/zkclient/serialize/ZkSerializer ClassNotFound
You can add this jar http://central.maven.org/maven2/com/101tec/zkclient/0.3/zkclient-0.3.jar in the classpath to get ride of this. If you are hitting further exceptions like classNotFound for metrics* etc, then make sure you have all these jars in the classpath: SPARK_CLASSPATH=SPARK_CLASSPATH:/root/akhld/spark/lib/ *spark-streaming-kafka_2.10-1.1.0.jar*:/root/akhld/spark/lib/ *kafka_2.10-0.8.0.jar*:/root/akhld/spark/lib/*zkclient-0.3.jar* :/root/akhld/spark/lib/*metrics-core-2.2.0.jar* Thanks Best Regards On Tue, Oct 21, 2014 at 10:45 AM, skane sk...@websense.com wrote: I'm having the same problem with Spark 1.0.0. I got the JavaKafkaWordCount.java example working on my workstation running Spark locally after doing a build, but when I tried to get the example running on YARN, I got the same error. I used the uber jar that was created during the build process for the examples, and I confirmed that org/I0Itec/zkclient/serialize/ZkSerializer is in the uber jar. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-I0Itec-zkclient-serialize-ZkSerializer-ClassNotFound-tp15919p16897.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode
What about start-all.sh or start-slaves.sh? Thanks Best Regards On Tue, Oct 21, 2014 at 10:25 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I'm working a cluster where I need to start the workers separately and connect them to a master. I'm following the instructions here and using branch-1.1 http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually and I can start the master using ./sbin/start-master.sh When I try to start the slave/worker using ./sbin/start-slave.sh it does't work. The logs say that it needs the master. when I provide ./sbin/start-slave.sh spark://master-ip:7077 it still doesn't work. I can start the worker using the following command (as described in the documentation). ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT Was wondering why start-slave.sh is not working? Thanks -Soumya
Re: default parallelism bug?
Hi, what do you mean by pretty small ? How big is your file ? Regards, Olivier. 2014-10-21 6:01 GMT+02:00 Kevin Jung itsjb.j...@samsung.com: I use Spark 1.1.0 and set these options to spark-defaults.conf spark.scheduler.mode FAIR spark.cores.max 48 spark.default.parallelism 72 Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787p16894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Convert Iterable to RDD
I don't think this is provided out of the box, but you can use toSeq on your Iterable and if the Iterable is lazy, it should stay that way for the Seq. And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your RDD. For the Iterable[Iterable[T]] you can flatten it and then create your RDD from the corresponding Iterable. Regards, Olivier. 2014-10-21 5:07 GMT+02:00 Dai, Kevin yun...@ebay.com: In addition, how to convert Iterable[Iterable[T]] to RDD[T] Thanks, Kevin. *From:* Dai, Kevin [mailto:yun...@ebay.com] *Sent:* 2014年10月21日 10:58 *To:* user@spark.apache.org *Subject:* Convert Iterable to RDD Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.
Re: RDD to Multiple Tables SparkSQL
If you already know your keys the best way would be to extract one RDD per key (it would not bring the content back to the master and you can take advantage of the caching features) and then execute a registerTempTable by Key. But I'm guessing, you don't know the keys in advance, and in this case, I think it becomes a very confusing point to put everything in different tables, First of all - how would you query it afterwards ? Regards, Olivier. 2014-10-20 13:02 GMT+02:00 critikaled isasmani@gmail.com: Hi I have a rdd which I want to register as multiple tables based on key val context = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(context) import sqlContext.createSchemaRDD case class KV(key:String,id:String,value:String) val logsRDD = context.textFile(logs, 10).map{line= val Array(key,id,value) = line split ' ' (key,id,value) }.registerTempTable(KVS) I want to store the above information to multiple tables based on key without bringing the entire data to master Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception
Hi All, I am trying to run the spark example JavaDecisionTree code using some external data set. It works for certain dataset only with specific maxBins and maxDepth settings. Even for a working dataset if I add a new data item I get a ArrayIndexOutOfBounds Exception, I get the same exception for the first case as well (changing maxBins and maxDepth). I am not sure what is wrong here, can anyone please explain this. Exception stacktrace: 14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage 7.0 (TID 13) java.lang.ArrayIndexOutOfBoundsException: 6301 at org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648) at org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) at org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0 (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301 org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648) org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706) org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798) org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
Re: What does KryoException: java.lang.NegativeArraySizeException mean?
Thanks, Guilaume, Below is when the exception happens, nothing has spilled to disk yet. And there isn't a join, but a partitionBy and groupBy action. Actually if numPartitions is small, it succeeds, while if it's large, it fails. Partition was simply done by override def getPartition(key: Any): Int = { (key.toString.hashCode Integer.MAX_VALUE) % numPartitions } IndexIDAttemptStatus ▾Locality LevelExecutorLaunch TimeDurationGC Time AccumulatorsShuffle ReadShuffle Spill (Memory)Shuffle Spill (Disk)Errors99 1730FAILEDPROCESS_LOCALgs-server-10002014/10/21 17:29:561.6 min30 s43.6 MB0.0 B0.0 B com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException Serialization trace: values (org.apache.spark.sql.catalyst.expressions.SpecificMutableRow) otherElements (org.apache.spark.util.collection.CompactBuffer) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585) com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38) com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:203) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:90) org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:89) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:625) org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:625) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 991751FAILEDPROCESS_LOCALgs-server-10002014/10/21 17:31:292.6 min39 s42.7 MB0.0 B0.0 B com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException Serialization trace: values (org.apache.spark.sql.catalyst.expressions.SpecificMutableRow) otherElements (org.apache.spark.util.collection.CompactBuffer) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585) com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38) com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?
Hi! The RANK function is available in hive since version 0.11. When trying to use it in SparkSQL, I'm getting the following exception (full stacktrace below): java.lang.ClassCastException: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be cast to org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer Is this function supposed to be available? Thanks P. --- java.lang.ClassCastException: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be cast to org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer at org.apache.spark.sql.hive.HiveUdafFunction.init(hiveUdfs.scala:334) at org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233) at org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207) at org.apache.spark.sql.execution.Aggregate.org$apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Getting Spark SQL talking to Sql Server
Hi, Is there a simple way to run spark sql queries against Sql Server databases? Or are we limited to running sql and doing sc.Parallelize()? Being able to query small amounts of lookup info directly from spark can save a bunch of annoying etl, and I'd expect Spark Sql to have some way of doing this. Cheers, Ashic.
Custom s3 endpoint
I have s3-compatible service and I'd like to have access to it in spark. From what I have gathered, I need to add s3service.s3-endpoint=my_s3_endpoint to file jets3t.properties in classpath. I'm not java programmer and I'm not sure where to put it in hello-world example. I managed to make it work with local master with this hack: Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME).setProperty(s3service.s3-endpoint, my_s3_endpoint); But this property fails to propagate when I run spark on mesos cluster. Putting correct jets3t.properties in SPARK_HOME/conf also helps only with local master mode. Can anyone help with this issue? Where should I put my j3tset.properties in java project? That would be super-awesome if Spark could pick up s3 endpoint from env variables like it does with s3 credentials. Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-s3-endpoint-tp16911.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting Spark SQL talking to Sql Server
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL server. Currently Spark SQL can't run queries against SQL server. The foreign data source API planned in Spark 1.2 can make this possible. On 10/21/14 6:26 PM, Ashic Mahtab wrote: Hi, Is there a simple way to run spark sql queries against Sql Server databases? Or are we limited to running sql and doing sc.Parallelize()? Being able to query small amounts of lookup info directly from spark can save a bunch of annoying etl, and I'd expect Spark Sql to have some way of doing this. Cheers, Ashic.
create a Row Matrix
Hi, I am VERY new to spark and mllib and ran into a couple of problems while trying to reproduce some examples. I am aware that this is a very simple question but could somebody please give me an example - how to create a RowMatrix in scala with the following entries: [1 2 3 4]? I would like to apply an SVD on it. Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql: join sql fails after sqlCtx.cacheTable()
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val personPath = /hdd/spark/person.json val person = sqlContext.jsonFile(personPath) person.printSchema() person.registerTempTable(person) val addressPath = /hdd/spark/address.json val address = sqlContext.jsonFile(addressPath) address.printSchema() address.registerTempTable(address) sqlContext.cacheTable(person) sqlContext.cacheTable(address) val rs2 = sqlContext.sql(SELECT p.id, p.name, a.city FROM person p, address a where p.id = a.id limit 10).collect.foreach(println) person.json {id:1,name:Mr. X} address.json {city:Earth,id:1} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why fetch failed
thank you it works!akka timeout may be bottle-neck in my system 在 Oct 20, 2014,17:07,Akhil Das ak...@sigmoidanalytics.com 写道: I used to hit this issue when my data size was too large and the number of partitions was too large ( 1200 ), I got ride of it by - Reducing the number of partitions - Setting the following while creating the sparkContext: .set(spark.rdd.compress,true) .set(spark.storage.memoryFraction,1) .set(spark.core.connection.ack.wait.timeout,600) .set(spark.akka.frameSize,50) Thanks Best Regards On Sun, Oct 19, 2014 at 6:52 AM, marylucy qaz163wsx_...@hotmail.com wrote: When doing groupby for big data,may be 500g,some partition tasks success,some partition tasks fetchfailed error. Spark system retry previous stage,but always fail 6 computers : 384g Worker:40g*7 for one computer Can anyone tell me why fetch failed??? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why fetch failed
thanks i need check spark 1.1.0 contain it 在 Oct 21, 2014,0:01,DB Tsai dbt...@dbtsai.com 写道: I ran into the same issue when the dataset is very big. Marcelo from Cloudera found that it may be caused by SPARK-2711, so their Spark 1.1 release reverted SPARK-2711, and the issue is gone. See https://issues.apache.org/jira/browse/SPARK-3633 for detail. You can checkout Cloudera's version here https://github.com/cloudera/spark/tree/cdh5-1.1.0_5.2.0 PS, I don't test it yet, but will test it in the following couple days, and report back. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Oct 18, 2014 at 6:22 PM, marylucy qaz163wsx_...@hotmail.com wrote: When doing groupby for big data,may be 500g,some partition tasks success,some partition tasks fetchfailed error. Spark system retry previous stage,but always fail 6 computers : 384g Worker:40g*7 for one computer Can anyone tell me why fetch failed??? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Getting Spark SQL talking to Sql Server
Thanks. Didn't know about jdbcrdd...should do nicely for now. The foreign data source api looks interesting... Date: Tue, 21 Oct 2014 20:33:03 +0800 From: lian.cs@gmail.com To: as...@live.com; user@spark.apache.org Subject: Re: Getting Spark SQL talking to Sql Server Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL server. Currently Spark SQL can't run queries against SQL server. The foreign data source API planned in Spark 1.2 can make this possible. On 10/21/14 6:26 PM, Ashic Mahtab wrote: Hi, Is there a simple way to run spark sql queries against Sql Server databases? Or are we limited to running sql and doing sc.Parallelize()? Being able to query small amounts of lookup info directly from spark can save a bunch of annoying etl, and I'd expect Spark Sql to have some way of doing this. Cheers, Ashic.
Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
That's true Guillaume. I'm currently aggregating documents considering a week as time range. I will have to make it daily and aggregate the results later. thanks for your hints anyway Arian Pasquali http://about.me/arianpasquali 2014-10-20 13:53 GMT+01:00 Guillaume Pitel guillaume.pi...@exensa.com: Hi, The array size you (or the serializer) tries to allocate is just too big for the JVM. No configuration can help : https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit The only option is to split you problem further by increasing parallelism. Guillaume Hi, I’m using Spark 1.1.0 and I’m having some issues to setup memory options. I get “Requested array size exceeds VM limit” and I’m probably missing something regarding memory configuration https://spark.apache.org/docs/1.1.0/configuration.html. My server has 30G of memory and this are my current settings. ##this one seams that was deprecated export SPARK_MEM=‘25g’ ## worker memory options seams to be the memory for each worker (by default we have a worker for each core) export SPARK_WORKER_MEMORY=‘5g’ I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS, but I’m not quite sure how. I have tried some different options like the following, but I still couldn’t make it right: export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' Does anyone has any idea how can I approach this? 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1566 non-empty blocks out of 1566 blocks 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (1 time so far) 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (2 times so far) 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 1566) java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140 Arian -- [image: eXenSa] *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705
Re: Streams: How do RDDs get Aggregated?
Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new Runnable() { def run() { for(x - 1 until 10) { val newMem = Runtime.getRuntime.freeMemory()/12188091; if(newMem != lastMem){ System.out.println(in thread : + newMem); } lastMem=newMem; store(mockStatus); } }}); Hope that helps somebody in the same situation. FYI Its in the docs :) * {{{ * class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) { * def onStart() { * // Setup stuff (start threads, open sockets, etc.) to start receiving data. * // Must start new thread to receive data, as onStart() must be non-blocking. * * // Call store(...) in those threads to store received data into Spark's memory. * * // Call stop(...), restart(...) or reportError(...) on any thread based on how * // different errors needs to be handled. * * // See corresponding method documentation for more details * } * * def onStop() { * // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data. * } * } * }}}
Re: How do you write a JavaRDD into a single file
Collect will store the entire output in a List in memory. This solution is acceptable for Little Data problems although if the entire problem fits in the memory of a single machine there is less motivation to use Spark. Most problems which benefit from Spark are large enough that even the data assigned to a single partition will not fit into memory. In my special case the output now is in the 0.5 - 4 GB range but in the future might get to 4 times that size - something a single machine could write but not hold at one time. I find that for most problems a file like Part-0001 is not what the next step wants to use - the minute a step is required to further process that file - even move and rename - there is little reason not to let the spark code write what is wanted in the first place. I like the solution of using toLocalIterator and writing my own file
Re: spark sql: join sql fails after sqlCtx.cacheTable()
Hi Tridib, I changed SQLContext to HiveContext and it started working. These are steps I used. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val person = sqlContext.jsonFile(json/person.json) person.printSchema() person.registerTempTable(person) val address = sqlContext.jsonFile(json/address.json) address.printSchema() address.registerTempTable(address) sqlContext.cacheTable(person) sqlContext.cacheTable(address) val rs2 = sqlContext.sql(select p.id,p.name,a.city from person p join address a on (p.id = a.id)).collect.foreach(println) Rishi@InfoObjects *Pure-play Big Data Consulting* On Tue, Oct 21, 2014 at 5:47 AM, tridib tridib.sama...@live.com wrote: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val personPath = /hdd/spark/person.json val person = sqlContext.jsonFile(personPath) person.printSchema() person.registerTempTable(person) val addressPath = /hdd/spark/address.json val address = sqlContext.jsonFile(addressPath) address.printSchema() address.registerTempTable(address) sqlContext.cacheTable(person) sqlContext.cacheTable(address) val rs2 = sqlContext.sql(SELECT p.id, p.name, a.city FROM person p, address a where p.id = a.id limit 10).collect.foreach(println) person.json {id:1,name:Mr. X} address.json {city:Earth,id:1} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql: join sql fails after sqlCtx.cacheTable()
Hmm... I thought HiveContext will only worki if Hive is present. I am curious to know when to use HiveContext and when to use SqlContext. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16924.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Cassandra connector issue
Hi, I am creating a cassandra java rdd and transforming it using the where clause. It works fine when I run it outside the mapValues, but when I put the code in mapValues I get an error while creating the transformation. Below is my sample code: CassandraJavaRDDReferenceData cassandraRefTable = javaFunctions(sc ).cassandraTable(reference_data, dept_reference_data, ReferenceData.class); JavaPairRDDString, Employee joinedRdd = rdd.mapValues(new FunctionIPLocation, IPLocation() { public Employee call(Employee employee) throws Exception { ReferenceData data = null; if(employee.getDepartment() != null) { data = referenceTable.where(postal_plus=?, location .getPostalPlus()).first(); System.out.println(data.toCSV()); } if(data != null) { //call setters on employee } return employee; } } I get this error: java.lang.NullPointerException at org.apache.spark.rdd.RDD.init(RDD.scala:125) at com.datastax.spark.connector.rdd.CassandraRDD.init( CassandraRDD.scala:47) at com.datastax.spark.connector.rdd.CassandraRDD.copy(CassandraRDD.scala:70) at com.datastax.spark.connector.rdd.CassandraRDD.where(CassandraRDD.scala:77 ) at com.datastax.spark.connector.rdd.CassandraJavaRDD.where( CassandraJavaRDD.java:54) Thanks for help!! Regards Ankur
Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?
No, analytic and window functions do not work yet. On Tue, Oct 21, 2014 at 3:00 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi! The RANK function is available in hive since version 0.11. When trying to use it in SparkSQL, I'm getting the following exception (full stacktrace below): java.lang.ClassCastException: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be cast to org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer Is this function supposed to be available? Thanks P. --- java.lang.ClassCastException: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be cast to org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer at org.apache.spark.sql.hive.HiveUdafFunction.init(hiveUdfs.scala:334) at org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233) at org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207) at org.apache.spark.sql.execution.Aggregate.org $apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
disk-backing pyspark rdds?
Hi All! I'm getting my feet wet with pySpark for the fairly boring case of doing parameter sweeps for monte carlo runs. Each of my functions runs for a very long time (2h+) and return numpy arrays on the order of ~100 MB. That is, my spark applications look like def foo(x): np.random.seed(x) eat_2GB_of_ram() take_2h() return my_100MB_array sc.parallelize(np.arange(100)).map(f).saveAsPickleFile(s3n://blah...) The resulting rdds will most likely not fit in memory but for this use case I don't really care. I know I can persist RDDs, but is there any way to by-default disk-back them (something analogous to mmap?) so that they don't create memory pressure in the system at all? With compute taking this long, the added overhead of disk and network IO is quite minimal. Thanks! ...Eric Jonas
stage failure: Task 0 in stage 0.0 failed 4 times
what could cause this type of 'stage failure'? Thanks! This is a simple py spark script to list data in hbase. command line: ./spark-submit --driver-class-path ~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-***.ec2.internal:35201 (size: 1470.0 B, free: 265.4 MB) 14/10/21 17:53:50 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 14/10/21 17:53:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[1] at map at PythonHadoopUtil.scala:185) 14/10/21 17:53:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:34050/user/Executor#681287499] with ID 0 14/10/21 17:53:53 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:53 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@ip-***.ec2.internal:47483/user/Executor#-936252397] with ID 1 14/10/21 17:53:53 INFO BlockManagerMasterActor: Registering block manager ip-2.internal:49236 with 3.1 GB RAM 14/10/21 17:53:54 INFO BlockManagerMasterActor: Registering block manager ip-.ec2.internal:36699 with 3.1 GB RAM 14/10/21 17:53:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-.ec2.internal): java.lang.IllegalStateException: unread block data java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor ip-.internal: java.lang.IllegalStateException (unread block data) [duplicate 1] 14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on executor ip-.internal: java.lang.IllegalStateException (unread block data) [duplicate 2] 14/10/21 17:53:54 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, ip-.internal, ANY, 1264 bytes) 14/10/21 17:53:54 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on executor ip-2.internal: java.lang.IllegalStateException (unread block data) [duplicate 3] 14/10/21 17:53:54 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 14/10/21 17:53:54 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/10/21 17:53:54 INFO TaskSchedulerImpl: Cancelling stage 0 14/10/21 17:53:54 INFO DAGScheduler: Failed to run first at SerDeUtil.scala:70 Traceback (most recent call last): File /root/workspace/test/sparkhbase.py, line 17, in module conf=conf2) File /root/spark/python/pyspark/context.py, line 471, in newAPIHadoopRDD jconf, batchSize) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-internal): java.lang.IllegalStateException: unread block data java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
Re: stage failure: Task 0 in stage 0.0 failed 4 times
maybe set up a hbase.jar in the conf? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL - TreeNodeException for unresolved attributes
Just to follow up, the queries worked against master and I got my whole flow rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with the next release of CDH5 :P -Terry From: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Date: Monday, October 20, 2014 at 12:22 PM To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Hi Michael, Thanks again for the reply. Was hoping it was something I was doing wrong in 1.1.0, but I’ll try master. Thanks, -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 20, 2014 at 12:11 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Have you tried this on master? There were several problems with resolution of complex queries that were registered as tables in the 1.1.0 release. On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Hi all, I’m getting a TreeNodeException for unresolved attributes when I do a simple select from a schemaRDD generated by a join in Spark 1.1.0. A little background first. I am using a HiveContext (against Hive 0.12) to grab two tables, join them, and then perform multiple INSERT-SELECT with GROUP BY to write back out to a Hive rollup table that has two partitions. This task is an effort to simulate the unsupported GROUPING SETS functionality in SparkSQL. In my first attempt, I got really close using SchemaRDD.groupBy until I realized that SchemaRDD.insertTo API does not support partitioned tables yet. This prompted my second attempt to pass in SQL to the HiveContext.sql API instead. Here’s a rundown of the commands I executed on the spark-shell: val hc = new HiveContext(sc) hc.setConf(spark.sql.hive.convertMetastoreParquet, true”) hc.setConf(spark.sql.parquet.compression.codec, snappy”) // For implicit conversions to Expression val sqlContext = new SQLContext(sc) import sqlContext._ val segCusts = hc.hql(“select …”) val segTxns = hc.hql(“select …”) val sc = segCusts.as('sc) val st = segTxns.as(‘st) // Join the segCusts and segTxns tables val rup = sc.join(st, Inner, Some(sc.segcustomerid.attr===st.customerid.attr)) rup.registerAsTable(“rupbrand”) If I do a printSchema on the rup, I get: root |-- segcustomerid: string (nullable = true) |-- sales: double (nullable = false) |-- tx_count: long (nullable = false) |-- storeid: string (nullable = true) |-- transdate: long (nullable = true) |-- transdate_ts: string (nullable = true) |-- transdate_dt: string (nullable = true) |-- unitprice: double (nullable = true) |-- translineitem: string (nullable = true) |-- offerid: string (nullable = true) |-- customerid: string (nullable = true) |-- customerkey: string (nullable = true) |-- sku: string (nullable = true) |-- quantity: double (nullable = true) |-- returnquantity: double (nullable = true) |-- channel: string (nullable = true) |-- unitcost: double (nullable = true) |-- transid: string (nullable = true) |-- productid: string (nullable = true) |-- id: string (nullable = true) |-- campaign_campaigncost: double (nullable = true) |-- campaign_begindate: long (nullable = true) |-- campaign_begindate_ts: string (nullable = true) |-- campaign_begindate_dt: string (nullable = true) |-- campaign_enddate: long (nullable = true) |-- campaign_enddate_ts: string (nullable = true) |-- campaign_enddate_dt: string (nullable = true) |-- campaign_campaigntitle: string (nullable = true) |-- campaign_campaignname: string (nullable = true) |-- campaign_id: string (nullable = true) |-- product_categoryid: string (nullable = true) |-- product_company: string (nullable = true) |-- product_brandname: string (nullable = true) |-- product_vendorid: string (nullable = true) |-- product_color: string (nullable = true) |-- product_brandid: string (nullable = true) |-- product_description: string (nullable = true) |-- product_size: string (nullable = true) |-- product_subcategoryid: string (nullable = true) |-- product_departmentid: string (nullable = true) |-- product_productname: string (nullable = true) |-- product_categoryname: string (nullable = true) |-- product_vendorname: string (nullable = true) |-- product_sku: string (nullable = true) |-- product_subcategoryname: string (nullable = true) |-- product_status: string (nullable = true) |-- product_departmentname: string (nullable = true) |-- product_style: string (nullable = true) |-- product_id: string (nullable = true) |-- customer_lastname: string (nullable =
Re: spark sql: join sql fails after sqlCtx.cacheTable()
Hmm... I thought HiveContext will only worki if Hive is present. I am curious to know when to use HiveContext and when to use SqlContext. http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started TLDR; Always use HiveContext if your application does not have a dependency conflict with Hive jars. :)
How to set hadoop native library path in spark-1.1
Hi all, Can anyone tell me how to set the native library path in Spark. Right not I am setting it using SPARK_LIBRARY_PATH environmental variable in spark-env.sh. But still no success. I am still seeing this in spark-shell. NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Thanks, Pradeep
Re: spark sql: join sql fails after sqlCtx.cacheTable()
Thank for pointing that out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Streams: How do RDDs get Aggregated?
Oh - and one other note on this, which appears to be the case. If , in your stream forEachRDD implementation, you do something stupid (like call rdd.count()) tweetStream.foreachRDD((rdd,lent)= { tweetStream.repartition(1) numTweetsCollected+=1; //val count = rdd.count() DONT DO THIS ! You can also get stuck in a situation where your RDD processor blocks infinitely. And for twitter specific stuff, make sure to look at modifying the TwitterInputDStream class so that it implements the stuff from SPARK-2464, which can lead to infinite stream reopening as well. On Tue, Oct 21, 2014 at 11:02 AM, jay vyas jayunit100.apa...@gmail.com wrote: Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new Runnable() { def run() { for(x - 1 until 10) { val newMem = Runtime.getRuntime.freeMemory()/12188091; if(newMem != lastMem){ System.out.println(in thread : + newMem); } lastMem=newMem; store(mockStatus); } }}); Hope that helps somebody in the same situation. FYI Its in the docs :) * {{{ * class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) { * def onStart() { * // Setup stuff (start threads, open sockets, etc.) to start receiving data. * // Must start new thread to receive data, as onStart() must be non-blocking. * * // Call store(...) in those threads to store received data into Spark's memory. * * // Call stop(...), restart(...) or reportError(...) on any thread based on how * // different errors needs to be handled. * * // See corresponding method documentation for more details * } * * def onStop() { * // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data. * } * } * }}} -- jay vyas
How to calculate percentiles with Spark?
Hi, What would be the best way to get percentiles from a Spark RDD? I can see JavaDoubleRDD or MLlib's MultivariateStatisticalSummary https://spark.apache.org/docs/latest/mllib-statistics.html provide the mean() but not percentiles. Thank you! Horace -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark-Submit Python along with JAR
Hi, I'd like to run my python script using spark-submit together with a JAR file containing Java specifications for a Hadoop file system. How can I do that? It seems I can either provide a JAR file or a PYthon file to spark-submit. So far I have been running my code in ipython with IPYTHON_OPTS=notebook --pylab inline /usr/local/spark/bin/pyspark --jars /usr/local/spark/HadoopFileFormat.jar Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Submit-Python-along-with-JAR-tp16938.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark sql: sqlContext.jsonFile date type detection and perforormance
Any help? or comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Primitive arrays in Spark
This is as much of a Scala question as a Spark question I have an RDD: val rdd1: RDD[(Long, Array[Long])] This RDD has duplicate keys that I can collapse such val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b) If I start with an Array of primitive longs in rdd1, will rdd2 also have Arrays of primitive longs? I suspect, based on my memory usage, that this is not the case. Also, would it be more efficient to do this: val rdd1: RDD[(Long, ArrayBuffer[Long])] and then val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b).map(_.toArray)
MLLib libsvm format
Hi All,I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order. However, do the indices within a row need to be sorted in ascending order? Sparse dataIt is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:label index1:value1 index2:value2 ... where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based. For example, I have have indices ranging rom 1 to 1000 is this as a libsvm data file OK? 1110:1.0 80:0.5 310:0.00 890:0.5 20:0.0 200:0.5 400:1.0 82:0.0 and so on: OR do I need to sort them as: 1 80:0.5 110:1.0 310:0.00 20:0.082:0.0200:0.5 400:1.0 890:0.5
Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?
Ok thanks Michael. In general, what's the easy way to figure out what's already implemented? The exception I was getting was not really helpful here? Also, is there a roadmap document somewhere ? Thanks! P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?
Thanks for the help! Hadoop version: 2.3.0 Hbase version: 0.98.1 Use python to read/write data from/to hbase. Only change over the official spark 1.1.0 is the pom file under examples. Compilation: spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package spark/examples:mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests clean package I am wondering how I can deploy this version of spark to a new ec2 cluster. I tried ./spark-ec2 -k sparkcluster -i ~/sparkcluster.pem -s 1 -v 1.1.0 --hadoop-major-version=2.3.0 --worker-instances=2 -z us-east-1d launch sparktest1 but this version got a type mismatch error when I read hbase data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to calculate percentiles with Spark?
A rather more general question is - assume I have an JavaRDDK which is sorted - How can I convert this into a JavaPairRDDInteger,K where the Integer is tie index - 0...N - 1. Easy to do on one machine JavaRDDK values = ... // create here JavaRDDInteger,K positions = values.mapToPair(new PairFunctionK, Integer, K() { private int index = 0; @Override public Tuple2Integer, K call(final K t) throws Exception { return new Tuple2(index++,t); } }); but will this code do the right thing on a cluster -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937p16945.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Class not found
maven cache is laid out differently but it does work on Linux and BSD/mac. Still looks like a hack to me. On Oct 21, 2014, at 1:28 PM, Pat Ferrel p...@occamsmachete.com wrote: Doesn’t this seem like a dangerous error prone hack? It will build different bits on different machines. It doesn’t even work on my linux box because the mvn install doesn’t cache the same as on the mac. If Spark is going to be supported on the maven repos shouldn’t it be addressed by different artifacts to support any option that changes the linkage info/class naming? On Oct 21, 2014, at 12:16 PM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: Not sure if this has been clearly explained here but since I took a day to track it down… Several people have experienced a class not found error on Spark when the class referenced is supposed to be in the Spark jars. One thing that can cause this is if you are building Spark for your cluster environment. The instructions say to do a “mvn package …” Instead some of these errors can be fixed using the following procedure: 1) delete ~/.m2/repository/org/spark and your-project 2) build Spark for your version of Hadoop *but do not use mvn package ...”* use “mvn install …” This will put a copy of the exact bits you need into the maven cache for building your-project against. In my case using hadoop 1.2.1 it was mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on Spark some failures can safely be ignored so check before giving up. 3) build your-project with “mvn clean install
com.esotericsoftware.kryo.KryoException: Buffer overflow.
I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420) at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) *Here is the code of the main function:* /String comparisonFieldIndexes = 16,18; String segmentFieldIndexes = 14,15; String comparisonFieldWeights = 50, 50; String delimiter = +'\001'; PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70, comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes, delimiter); JavaRDDString filtered_rdd = origRDD.filter(parOnCol.new FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) ); parOnCol.printRDD(filtered_rdd);/ *Here is the FilterEmptyFields class:* /public class FilterEmptyFields implements FunctionString, Boolean { final int[] nonEmptyFields; final String DELIMITER; public FilterEmptyFields(int[] nonEmptyFields, String delimiter){ this.nonEmptyFields = nonEmptyFields; this.DELIMITER = delimiter; } @Override public Boolean call(String s){ String[] fields = s.split(DELIMITER); for(int i=0; inonEmptyFields.length; i++){ if(fields[nonEmptyFields[i]] == null || fields[nonEmptyFields[i]].isEmpty()){ return false; } } return true; } }lt;/i Any suggestions guys? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SchemaRDD.where clause error
Hi all, I tried to use the function SchemaRDD.where() but got some error: val people = sqlCtx.sql(select * from people) people.where('age === 10) console:27: error: value === is not a member of Symbol where did I go wrong? Thanks, Kevin Paul
buffer overflow when running Kmeans
this is the stack trace I got with yarn logs -applicationId really no idea where to dig further. thanks! yang 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [ phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21] 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 3, required: 8 Serialization trace: data$mcD$sp (breeze.linalg.DenseVector$mcD$sp) at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477) at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:142) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Re: SchemaRDD.where clause error
You need to import sqlCtx._ to get access to the implicit conversion. On Tue, Oct 21, 2014 at 2:40 PM, Kevin Paul kevinpaulap...@gmail.com wrote: Hi all, I tried to use the function SchemaRDD.where() but got some error: val people = sqlCtx.sql(select * from people) people.where('age === 10) console:27: error: value === is not a member of Symbol where did I go wrong? Thanks, Kevin Paul
Re: buffer overflow when running Kmeans
Just posted below for a similar question. Have you seen this thread ? http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflowsubj=RE+spark+nbsp+kryo+serilizable+nbsp+exception On Tue, Oct 21, 2014 at 2:44 PM, Yang tedd...@gmail.com wrote: this is the stack trace I got with yarn logs -applicationId really no idea where to dig further. thanks! yang 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [ phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21] 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 3, required: 8 Serialization trace: data$mcD$sp (breeze.linalg.DenseVector$mcD$sp) at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477) at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:142) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
How to read BZ2 XML file in Spark?
Hi, I want to ingest Open Street Map. It's 43GB (compressed) XML in BZIP2 format. What's your advice for reading it in to an RDD? BTW, the Spark Training at UMD is awesome! I'm having a blast learning Spark. I wish I could go to the MeetUp tonight, but I have kid activities... http://wiki.openstreetmap.org/wiki/Planet.osm http://wiki.openstreetmap.org/wiki/OSM_XML John -- John S. Roberts SigInt Technologies LLC, a Novetta Solutions Company 8830 Stanford Blvd, Suite 306; Columbia, MD 21045 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql: sqlContext.jsonFile date type detection and perforormance
Is there any specific issues you are facing? Thanks, Yin On Tue, Oct 21, 2014 at 4:00 PM, tridib tridib.sama...@live.com wrote: Any help? or comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib libsvm format
Yes. where the indices are one-based and **in ascending order**. -Xiangrui On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order. However, do the indices within a row need to be sorted in ascending order? Sparse data It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ... where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based. For example, I have have indices ranging rom 1 to 1000 is this as a libsvm data file OK? 1110:1.0 80:0.5 310:0.0 0 890:0.5 20:0.0 200:0.5 400:1.0 82:0.0 and so on: OR do I need to sort them as: 1 80:0.5 110:1.0 310:0.0 0 20:0.082:0.0200:0.5 400:1.0 890:0.5 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark ui redirecting to port 8100
Set up the spark port to a different one and the connection seems successful but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: create a Row Matrix
Please check out the example code: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala -Xiangrui On Tue, Oct 21, 2014 at 5:34 AM, viola viola.wiersc...@siemens.com wrote: Hi, I am VERY new to spark and mllib and ran into a couple of problems while trying to reproduce some examples. I am aware that this is a very simple question but could somebody please give me an example - how to create a RowMatrix in scala with the following entries: [1 2 3 4]? I would like to apply an SVD on it. Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: MLLib libsvm format
Great, I will sort them. Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone div Original message /divdivFrom: Xiangrui Meng men...@gmail.com /divdivDate:10/21/2014 3:29 PM (GMT-08:00) /divdivTo: Sameer Tilak ssti...@live.com /divdivCc: user@spark.apache.org /divdivSubject: Re: MLLib libsvm format /divdiv /div Yes. where the indices are one-based and **in ascending order**. -Xiangrui On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order. However, do the indices within a row need to be sorted in ascending order? Sparse data It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 ... where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based. For example, I have have indices ranging rom 1 to 1000 is this as a libsvm data file OK? 1110:1.0 80:0.5 310:0.0 0 890:0.5 20:0.0 200:0.5 400:1.0 82:0.0 and so on: OR do I need to sort them as: 1 80:0.5 110:1.0 310:0.0 0 20:0.082:0.0200:0.5 400:1.0 890:0.5
Re: How to read BZ2 XML file in Spark?
Hi John, Glad you're enjoying the Spark training at UMD. Is the 43 GB XML data in a single file or split across multiple BZIP2 files? Is the file in a HDFS cluster or on a single linux machine? If you're using BZIP2 with splittable compression (in HDFS), you'll need at least Hadoop 1.1: https://issues.apache.org/jira/browse/HADOOP-7823 Or if you've got the file on a single linux machine, perhaps consider uncompressing it manually using cmd line tools before loading it into Spark. You'll want to start with maybe 1 GB for each partition, so if the uncompressed file is 100GB, maybe start with 100 partitions. Even if the entire dataset is in one file (which might get read into just 1 or 2 partitions initially with Spark), you can use the repartition(numPartitions) transformation to make 100 partitions. Then you'll have to make sense of the XML schema. You have a few options to do this. You can take advantage of Scala’s XML functionality provided by the scala.xml package to parse the data. Here is a blog post with some code example for this: http://stevenskelton.ca/real-time-data-mining-spark/ Or try sc.wholeTextFiles(). It reads the entire file into a string record. You'll want to make sure that you have enough memory to read the single string into memory. This Cloudera blog post about half-way down has some Regex examples of how to use Scala to parse an XML file into a collection of tuples: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ You can also search for XMLInputFormat on Google. There are some implementations that allow you to specify the tag to split on, e.g.: https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java Good luck! Sameer F. Client Services @ Databricks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-BZ2-XML-file-in-Spark-tp16954p16960.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark ui redirecting to port 8100
Hi Sadhan, Which port are you specifically trying to redirect? The driver program has a web UI, typically on port 4040... or the Spark Standalone Cluster Master has a UI exposed on port 7077. Which setting did you update in which file to make this change? And finally, which version of Spark are you on? Sameer F. Client Services @ Databricks On Tue, Oct 21, 2014 at 3:29 PM, sadhan sadhan.s...@gmail.com wrote: Set up the spark port to a different one and the connection seems successful but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming - How to write RDD's in same directory ?
Hello, Spark 1.1.0, Hadoop 2.4.1 I have written a Spark streaming application. And I am getting FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath). Here is brief what I am is trying to do. My application is creating text file stream using Java Stream context. The input file is on HDFS. JavaDStreamString textStream = ssc.textFileStream(InputFile); Then it is comparing each line of input stream with some data and filtering it. The filtered data I am storing in JavaDStreamString. JavaDStreamString suspectedStream= textStream.flatMap(new FlatMapFunctionString,String(){ @Override public IterableString call(String line) throws Exception { ListString filteredList = new ArrayListString(); // doing filter job return filteredList; } And this filteredList I am storing in HDFS as: suspectedStream.foreach(new FunctionJavaRDDlt;String,Void(){ @Override public Void call(JavaRDDString rdd) throws Exception { rdd.saveAsTextFile(outputFolderPath); return null; }}); But with this I am receiving org.apache.hadoop.mapred.FileAlreadyExistsException. I tried with appending random number with outputFolderPath and its working. But my requirement is to collect all output in one directory. Can you please suggest if there is any way to get rid of this exception ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?
Hi, Can you post what the error looks like? Sameer F. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception
Hi, this sounds like a bug which has been fixed in the current master. What version of Spark are you using? Would it be possible to update to the current master? If not, it would be helpful to know some more of the problem dimensions (num examples, num features, feature types, label type). Thanks, Joseph On Tue, Oct 21, 2014 at 2:42 AM, lokeshkumar lok...@dataken.net wrote: Hi All, I am trying to run the spark example JavaDecisionTree code using some external data set. It works for certain dataset only with specific maxBins and maxDepth settings. Even for a working dataset if I add a new data item I get a ArrayIndexOutOfBounds Exception, I get the same exception for the first case as well (changing maxBins and maxDepth). I am not sure what is wrong here, can anyone please explain this. Exception stacktrace: 14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage 7.0 (TID 13) java.lang.ArrayIndexOutOfBoundsException: 6301 at org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648) at org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) at org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) at org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0 (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301 org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648) org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706) org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798) org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
Re: Spark Streaming - How to write RDD's in same directory ?
Hi Shailesh, Spark just leverages the Hadoop File Output Format to write out the RDD you are saving. This is really a Hadoop OutputFormat limitation which requires the directory it is writing into to not exist. The idea is that a Hadoop job should not be able to overwrite the results from a previous job, so it enforces that the dir should not exist. Easiest way to get around this may be to just write the results from each Spark app to a newly named directory, then on an interval run a simple script to merge data from multiple HDFS directories into one directory. This HDFS command will let you do something like a directory merge: hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal - /newfolderpath/file See this StackOverflow discussion for a way to do it using Pig and Bash scripting also: https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder Sameer F. Client Services @ Databricks On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari sbir...@wynyardgroup.com wrote: Hello, Spark 1.1.0, Hadoop 2.4.1 I have written a Spark streaming application. And I am getting FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath). Here is brief what I am is trying to do. My application is creating text file stream using Java Stream context. The input file is on HDFS. JavaDStreamString textStream = ssc.textFileStream(InputFile); Then it is comparing each line of input stream with some data and filtering it. The filtered data I am storing in JavaDStreamString. JavaDStreamString suspectedStream= textStream.flatMap(new FlatMapFunctionString,String(){ @Override public IterableString call(String line) throws Exception { ListString filteredList = new ArrayListString(); // doing filter job return filteredList; } And this filteredList I am storing in HDFS as: suspectedStream.foreach(new FunctionJavaRDDlt;String,Void(){ @Override public Void call(JavaRDDString rdd) throws Exception { rdd.saveAsTextFile(outputFolderPath); return null; }}); But with this I am receiving org.apache.hadoop.mapred.FileAlreadyExistsException. I tried with appending random number with outputFolderPath and its working. But my requirement is to collect all output in one directory. Can you please suggest if there is any way to get rid of this exception ? Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using the DataStax Cassandra Connector from PySpark
Hi there, I'm using Spark 1.1.0 and experimenting with trying to use the DataStax Cassandra Connector (https://github.com/datastax/spark-cassandra-connector) from within PySpark. As a baby step, I'm simply trying to validate that I have access to classes that I'd need via Py4J. Sample python program: from py4j.java_gateway import java_import from pyspark.conf import SparkConf from pyspark import SparkContext conf = SparkConf().set(spark.cassandra.connection.host, 127.0.0.1) sc = SparkContext(appName=Spark + Cassandra Example, conf=conf) java_import(sc._gateway.jvm, com.datastax.spark.connector.*) print sc._jvm.CassandraRow() CassandraRow corresponds to https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/CassandraRow.scala which is included in the JAR I submit. Feel free to download the JAR here https://dl.dropboxusercontent.com/u/4385786/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar I'm currently running this Python example with: spark-submit --driver-class-path=/path/to/pyspark-cassandra-0.1.0-SNAPSHOT-standalone.jar --verbose src/python/cassandara_example.py But continually get the following error indicating that the classes aren't in fact on the classpath of the GatewayServer: Traceback (most recent call last): File /Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py, line 37, in module main() File /Users/mikesukmanowsky/Development/parsely/pyspark-cassandra/src/python/cassandara_example.py, line 25, in main print sc._jvm.CassandraRow() File /Users/mikesukmanowsky/.opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 726, in __getattr__ py4j.protocol.Py4JError: Trying to call a package. The correct response from the GatewayServer should be: In [22]: gateway.jvm.CassandraRow() Out[22]: JavaObject id=o0 Also tried using --jars option instead and that doesn't seem to work either. Is there something I'm missing as to why the classes aren't available? -- Mike Sukmanowsky Aspiring Digital Carpenter *p*: +1 (416) 953-4248 *e*: mike.sukmanow...@gmail.com facebook http://facebook.com/mike.sukmanowsky | twitter http://twitter.com/msukmanowsky | LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github https://github.com/msukmanowsky
Re: Primitive arrays in Spark
It seems that ++ does the right thing on arrays of longs, and gives you another one: scala val a = Array[Long](1,2,3) a: Array[Long] = Array(1, 2, 3) scala val b = Array[Long](1,2,3) b: Array[Long] = Array(1, 2, 3) scala a ++ b res0: Array[Long] = Array(1, 2, 3, 1, 2, 3) scala res0.getClass res1: Class[_ : Array[Long]] = class [J The problem might be that lots of intermediate space is allocated as you merge values two by two. In particular, if a key has N arrays mapping to it, your code will allocate O(N^2) space because it builds first an array of size 1, then 2, then 3, etc. You can make this faster by using aggregateByKey instead, and using an intermediate data structure other than an Array to do the merging (ideally you'd find a growable ArrayBuffer-like class specialized for Longs, but you can also just try ArrayBuffer). Matei On Oct 21, 2014, at 1:08 PM, Akshat Aranya aara...@gmail.com wrote: This is as much of a Scala question as a Spark question I have an RDD: val rdd1: RDD[(Long, Array[Long])] This RDD has duplicate keys that I can collapse such val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b) If I start with an Array of primitive longs in rdd1, will rdd2 also have Arrays of primitive longs? I suspect, based on my memory usage, that this is not the case. Also, would it be more efficient to do this: val rdd1: RDD[(Long, ArrayBuffer[Long])] and then val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b).map(_.toArray) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming - How to write RDD's in same directory ?
Thanks Sameer for quick reply. I will try to implement it. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql not able to find classes with --jars option
It was mainly because spark was setting the jar classes in a thread local context classloader. The quick fix was to make our serde use the context classloader first. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-not-able-to-find-classes-with-jars-option-tp16839p16972.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Strategies for reading large numbers of files
Thanks to folks here for the suggestions. I ended up settling on what seems to be a simple and scalable approach. I am no longer using sparkContext.textFiles with wildcards (it is too slow when working with a large number of files). Instead, I have implemented directory traversal as a Spark job, which enables it to parallelize across the cluster. First, a couple of functions. One to traverse directories, and another to get the lines in a file: def list_file_names(path: String): Seq[String] = { val fs = FileSystem.get(new java.net.URI(path), hadoopConfiguration) def f(path: Path): Seq[String] = { Option(fs.listStatus(path)).getOrElse(Array[FileStatus]()). flatMap { case fileStatus if fileStatus.isDir ⇒ f(fileStatus.getPath) case fileStatus ⇒ Seq(fileStatus.getPath.toString) } } f(new Path(path)) } def read_log_file(path: String): Seq[String] = { val fs = FileSystem.get(new java.net.URI(path), hadoopConfiguration) val file = fs.open(new Path(path)) val source = Source.fromInputStream(file) source.getLines.toList } Next, I generate a list of root paths to scan: val paths = for { record_type ← record_types year ← years month ← months day ← days hour ← hours } yield ss3n://s3-bucket-name/$record_type/$year/$month/$day/$hour/ } (In this case, I generate one path per hour per record type.) Finally, using Spark, I can build an RDD with the contents of every file in the path list: val rdd: RDD[String] = sparkContext. parallelize(paths, paths.size). flatMap(list_file_names). flatMap(read_log_file) I am posting this info here with the hope that it will be useful to somebody in the future. L On Tue, Oct 7, 2014 at 12:58 AM, deenar.toraskar deenar.toras...@db.com wrote: Hi Landon I had a problem very similar to your, where we have to process around 5 million relatively small files on NFS. After trying various options, we did something similar to what Matei suggested. 1) take the original path and find the subdirectories under that path and then parallelize the resulting list. you can configure the depth you want to go down to before sending the paths across the cluster. def getFileList(srcDir:File, depth:Int) : List[File] = { var list : ListBuffer[File] = new ListBuffer[File]() if (srcDir.isDirectory()) { srcDir.listFiles() .foreach((file: File) = if (file.isFile()) { list +=(file) } else { if (depth 0 ) { list ++= getFileList(file, (depth- 1 )) } else if (depth 0) { list ++= getFileList(file, (depth)) } else { list += file } }) } else { list += srcDir } list .toList } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-tp15644p15835.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Landon Kuhn*, *Software Architect*, Janrain, Inc. http://bit.ly/cKKudR E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025 Follow Janrain: Facebook http://bit.ly/9CGHdf | Twitter http://bit.ly/9umxlK | YouTube http://bit.ly/N0OiBT | LinkedIn http://bit.ly/a7WZMC | Blog http://bit.ly/OI2uOR Follow Me: LinkedIn http://www.linkedin.com/in/landonkuhn - *Acquire, understand, and engage your users. Watch our video http://bit.ly/janrain-overview or sign up for a live demo http://bit.ly/janraindemo to see what it's all about.*
Re: spark sql: sqlContext.jsonFile date type detection and perforormance
Yes, I am unable to use jsonFile() so that it can detect date type automatically from json data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming Applications
Hi, I have been trying to find a fairly complex application that makes use of the Spark Streaming framework. I checked public github repos but the examples I found were too simple, only comprising simple operations like counters and sums. On the Spark summit website, I could find very interesting projects, however no source code was available. Where can I find non-trivial spark streaming application code? Is it that difficult? Thanks.
spark 1.1.0 RDD and Calliope 1.1.0-CTP-U2-H2
Hi, I am using the latest calliope library from tuplejump.com to create RDD for cassandra table. I am on a 3 nodes spark 1.1.0 with yarn. My cassandra table is defined as below and I have about 2000 rows of data inserted. CREATE TABLE top_shows ( program_id varchar, view_minute timestamp, view_count counter, PRIMARY KEY (view_minute, program_id) //note that view_minute is the partition key ); Here are the simple steps I ran from spark-shell on master node spark-shell --master yarn-client --jars rna/rna-spark-streaming-assembly-1.0-SNAPSHOT.jar --driver-memory 512m --executor-memory 512m --num-executors 3 --executor-cores 1 // Import the necessary import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import com.tuplejump.calliope.utils.RichByteBuffer._ import com.tuplejump.calliope.Implicits._ import com.tuplejump.calliope.CasBuilder import com.tuplejump.calliope.Types.{CQLRowKeyMap, CQLRowMap} // Define my class and the implicit cast case class ProgramViewCount(viewMinute:Long, program:String, viewCount:Long) implicit def keyValtoProgramViewCount(key:CQLRowKeyMap, values:CQLRowMap):ProgramViewCount = ProgramViewCount(key.get(view_minute).get.getLong, key.get(program_id).toString, values.get(view_count).get.getLong) // Use the cql3 interface to read from table with WHERE predicate. val cas = CasBuilder.cql3.withColumnFamily(streaming_qa, top_shows).onHost(23.22.120.96) .where(view_minute = 141386178) val allPrograms = sc.cql3Cassandra[ProgramViewCount](cas) // Lazy evaluation till this point val rowCount = allPrograms.count I hit the following exception. It seems that it does not like my where clause. If I do not have the WHERE CLAUSE, it works fine. But with the WHERE CLAUSE, no matter the predicate is on partition key or not, it will fail with the following exception. Anyone else using calliope package can share some lights? Thanks a lot. Tian scala val rowCount = allPrograms.count 14/10/21 23:26:07 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, ip-10-187-51-136.ec2.internal): java.lang.RuntimeException: com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665) com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader$RowIterator.init(CqlPagingRecordReader.java:301) com.tuplejump.calliope.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167) com.tuplejump.calliope.cql3.Cql3CassandraRDD$$anon$1.init(Cql3CassandraRDD.scala:75) com.tuplejump.calliope.cql3.Cql3CassandraRDD.compute(Cql3CassandraRDD.scala:64) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-RDD-and-Calliope-1-1-0-CTP-U2-H2-tp16975.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Cassandra connector issue
Is this because I am calling a transformation function on an rdd from inside another transformation function? Is it not allowed? Thanks Ankut On Oct 21, 2014 1:59 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi Gerard, this is the code that may be helpful. public class ReferenceDataJoin implements Serializable { private static final long serialVersionUID = 1039084794799135747L; JavaPairRDDString, Employee rdd; CassandraJavaRDDReferenceData referenceTable; public PostalReferenceDataJoin(ListEmployee employees) { JavaSparkContext sc = SparkContextFactory.getSparkContextFactory().getSparkContext(); this.rdd = sc.parallelizePairs(employees); this. referenceTable = javaFunctions(sc).cassandraTable(reference_data, “dept_reference_data, ReferenceData.class); } public JavaPairRDDString, Employee execute() { JavaPairRDDString, Employee joinedRdd = rdd .mapValues(new FunctionEmployee, Employee() { private static final long serialVersionUID = -226016490083377260L; @Override public Employee call(Employee employee) throws Exception { ReferenceData data = null; if (employee.getDepartment() != null) { data = referenceTable.where(“dept=?, employee.getDepartment()).first();; System.out.println(employee.getDepartment() + + data); } if (data != null) { //setters on employee } return employee; } }); return joinedRdd; } } Thanks Ankur On Tue, Oct 21, 2014 at 11:11 AM, Gerard Maas gerard.m...@gmail.com wrote: Looks like that code does not correspond to the problem you're facing. I doubt it would even compile. Could you post the actual code? -kr, Gerard On Oct 21, 2014 7:27 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi, I am creating a cassandra java rdd and transforming it using the where clause. It works fine when I run it outside the mapValues, but when I put the code in mapValues I get an error while creating the transformation. Below is my sample code: CassandraJavaRDDReferenceData cassandraRefTable = javaFunctions(sc ).cassandraTable(reference_data, dept_reference_data, ReferenceData.class); JavaPairRDDString, Employee joinedRdd = rdd.mapValues(new FunctionIPLocation, IPLocation() { public Employee call(Employee employee) throws Exception { ReferenceData data = null; if(employee.getDepartment() != null) { data = referenceTable.where(postal_plus=?, location .getPostalPlus()).first(); System.out.println(data.toCSV()); } if(data != null) { //call setters on employee } return employee; } } I get this error: java.lang.NullPointerException at org.apache.spark.rdd.RDD.init(RDD.scala:125) at com.datastax.spark.connector.rdd.CassandraRDD.init( CassandraRDD.scala:47) at com.datastax.spark.connector.rdd.CassandraRDD.copy( CassandraRDD.scala:70) at com.datastax.spark.connector.rdd.CassandraRDD.where( CassandraRDD.scala:77) at com.datastax.spark.connector.rdd.CassandraJavaRDD.where( CassandraJavaRDD.java:54) Thanks for help!! Regards Ankur
Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance
Add one more thing about question 1. Once you get the SchemaRDD from jsonFile/jsonRDD, you can use CAST(columnName as DATE) in your query to cast the column type from the StringType to DateType (the string format should be -[m]m-[d]d and you need to use hiveContext). Here is the code snippet that may help. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val schemaRDD = hiveContext.jsonFile(...) schemaRDD.registerTempTable(jsonTable) hiveContext.sql(SELECT CAST(columnName as DATE) FROM jsonTable) Thanks, Yin On Tue, Oct 21, 2014 at 8:00 PM, Yin Huai huaiyin@gmail.com wrote: Hello Tridib, I just saw this one. 1. Right now, jsonFile and jsonRDD do not detect date type. Right now, IntegerType, LongType, DoubleType, DecimalType, StringType, BooleanType, StructType and ArrayType will be automatically detected. 2. The process of inferring schema will pass the entire dataset once to determine the schema. So, you will see a join is launched. Applying a specific schema to a dataset does not have this cost. 3. It is hard to comment on it without seeing your implementation. For our built-in JSON support, jsonFile and jsonRDD provides a very convenient way to work with JSON datasets with SQL. You do not need to define the schema in advance and Spark SQL will automatically create the SchemaRDD for your dataset. You can start to query it with SQL by simply registering the returned SchemaRDD as a temp table. Regarding the implementation, we use a high performance JSON lib (Jackson, https://github.com/FasterXML/jackson) to parse JSON records. Thanks, Yin On Mon, Oct 20, 2014 at 10:56 PM, tridib tridib.sama...@live.com wrote: Hi Spark SQL team, I trying to explore automatic schema detection for json document. I have few questions: 1. What should be the date format to detect the fields as date type? 2. Is automatic schema infer slower than applying specific schema? 3. At this moment I am parsing json myself using map Function and creating schema RDD from the parsed JavaRDD. Is there any performance impact not using inbuilt jsonFile()? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark - HiveContext - Unstructured Json
You can resort to |SQLContext.jsonFile(path: String, samplingRate: Double)| and set |samplingRate| to 1.0, so that all the columns can be inferred. You can also use |SQLContext.applySchema| to specify your own schema (which is a |StructType|). On 10/22/14 5:56 AM, Harivardan Jayaraman wrote: Hi, I have unstructured JSON as my input which may have extra columns row to row. I want to store these json rows using HiveContext so that it can be accessed from the JDBC Thrift Server. I notice there are primarily only two methods available on the SchemaRDD for data - saveAsTable and insertInto. One defines the schema while the other can be used to insert in to the table, but there is no way to Alter the table and add columns to it. How do I do this? One option that I thought of is to write native CREATE TABLE... and ALTER TABLE.. statements but just does not seem feasible because at every step, I will need to query Hive to determine what is the current schema and make a decision whether I should add columns to it or not. Any thoughts? Has anyone been able to do this?
Re: Asynchronous Broadcast from driver to workers, is it possible?
Looks like the only way is to implement that feature. There is no way of hacking it into working -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: com.esotericsoftware.kryo.KryoException: Buffer overflow.
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer resizing) or spark 1.0 (which has a fixed size buffer)? On Oct 21, 2014 5:30 PM, nitinkak001 nitinkak...@gmail.com wrote: I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420) at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) *Here is the code of the main function:* /String comparisonFieldIndexes = 16,18; String segmentFieldIndexes = 14,15; String comparisonFieldWeights = 50, 50; String delimiter = +'\001'; PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70, comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes, delimiter); JavaRDDString filtered_rdd = origRDD.filter(parOnCol.new FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) ); parOnCol.printRDD(filtered_rdd);/ *Here is the FilterEmptyFields class:* /public class FilterEmptyFields implements FunctionString, Boolean { final int[] nonEmptyFields; final String DELIMITER; public FilterEmptyFields(int[] nonEmptyFields, String delimiter){ this.nonEmptyFields = nonEmptyFields; this.DELIMITER = delimiter; } @Override public Boolean call(String s){ String[] fields = s.split(DELIMITER); for(int i=0; inonEmptyFields.length; i++){ if(fields[nonEmptyFields[i]] == null || fields[nonEmptyFields[i]].isEmpty()){ return false; } } return true; } }lt;/i Any suggestions guys? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception
Hi Joseph I am using spark 1.1.0 the latest version, I will try to update to the current master and check. The example I am running is JavaDecisionTree, the dataset is of libsvm format containing 1. 45 instances of training sample. 2. 5 features 3. I am not sure what is feature type, but there are no categorical features being passed in the example. 4. Three labels, not sure what label type is. The example runs fine with 100 maxBins as value, but when I change this to say 50 or 30 I get the exception. Also could you please let me know what should be the default value for maxBins(API says 100 as default but it did not work in this case)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907p16988.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Num-executors and executor-cores overwritten by defaults
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can no longer set the number of executors or executor-cores. No matter what values I pass on the command line to spark they are overwritten by the defaults. Does anyone have any idea what could have happened here? Running on Spark 1.02 before I had no trouble. Also I am able to launch the spark shell without these parameters being overwritten.
spark sql query optimization , and decision tree building
Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension. I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decision tree (kd tree) with this large data . My steps : 1) calculate variance of each column pick the column with maximum variance and make it as key of first node , and mean of the column as the value of the node. 2) based on the first node value split the data into 2 parts an repeat the process until you reach a point. My sample code : import sqlContext._ val people = sqlContext.jsonFile(siftoutput/) people.printSchema() people.registerTempTable(people) val output = sqlContext.sql(SELECT * From people) My Questions : 1) How to save result values of a query into a list ? 2) How to calculate variance of a column .Is there any efficient way? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? 4) how to save the output as key value pairs in a text file ? 5) is there any way i can build decision kd tree using machine libraries of spark ? please help Thanks, Sanath
Re: create a Row Matrix
Thanks for the quick response. However, I still only get error messages. I am able to load a .txt file with entries in it and use it in sparks, but I am not able to create a simple matrix, for instance a 2x2 row matrix [1 2 3 4] I tried variations such as val RowMatrix = Matrix(2,2,array(1,3,2,4)) but it doesn't work.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/create-a-Row-Matrix-tp16913p16993.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org