can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Hi,

I have a DStream that works just fine when I say:

dstream.print

If I say:

dstream.map(_,1).print

that works, too.  However, if I do the following:

dstream.reduce{case(x,y) => x}.print

I don't get anything on my console.  What's going on?

Thanks


Re: Large Task Size?

2014-07-13 Thread Kyle Ellrott
It uses the standard SquaredL2Updater, and I also tried to broadcast it as
well.

The input is a RDD created by taking the union of several inputs, that have
all been run against MLUtils.kFold to produce even more RDDs. If I run with
10 different inputs, each with 10 kFolds. I'm pretty certain that all of
the input RDDs have clean closures. But I'm curious, is there a high
overhead for running union? Could that create larger task sizes?

Kyle



On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson  wrote:

> I also did a quick glance through the code and couldn't find anything
> worrying that should be included in the task closures. The only possibly
> unsanitary part is the Updater you pass in -- what is your Updater and is
> it possible it's dragging in a significant amount of extra state?
>
>
> On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott 
> wrote:
>
>> I'm working of a patch to MLLib that allows for multiplexing several
>> different model optimization using the same RDD ( SPARK-2372:
>> https://issues.apache.org/jira/browse/SPARK-2372 )
>>
>> In testing larger datasets, I've started to see some memory errors (
>> java.lang.OutOfMemoryError and "exceeds max allowed: spark.akka.frameSize"
>> errors ).
>> My main clue is that Spark will start logging warning on smaller systems
>> like:
>>
>> 14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
>> task of very large size (10119 KB). The maximum recommended task size is
>> 100 KB.
>>
>> Looking up start '2862' in the case leads to a 'sample at
>> GroupedGradientDescent.scala:156' call. That code can be seen at
>>
>> https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156
>>
>> I've looked over the code, I'm broadcasting the larger variables, and
>> between the sampler and the combineByKey, I wouldn't think there much data
>> being moved over the network, much less a 10MB chunk.
>>
>> Any ideas of what this might be a symptom of?
>>
>> Kyle
>>
>>
>


Re: Nested Query With Spark SQL(1.0.1)

2014-07-13 Thread anyweil
Or is it supported? I know I could doing it myself with filter, but if SQL
could support, would be much better, thx!
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Query-With-Spark-SQL-1-0-1-tp9544p9547.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-13 Thread Peretz, Gil
Thank You, I managed to activate Spark-Shell and to use it, by re-compiling and 
packaging Spark.

SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean 
assembly


Regards.
---
Gil Peretz , +972-54-5597107

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: יום ד 09 יולי 2014 21:11
To: user@spark.apache.org
Subject: Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize 
class $line10.$read$

At first glance that looks like an error with the class shipping in the spark 
shell.  (i.e. the line that you type into the spark shell are compiled into 
classes and then shipped to the executors where they run).  Are you able to run 
other spark examples with closures in the same shell?

Michael

On Wed, Jul 9, 2014 at 4:28 AM, gil...@gmail.com 
mailto:gil...@gmail.com>> wrote:
Hello, While trying to run this example below I am getting errors. I have build 
Spark using the followng command: $ SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true 
SPARK_HIVE=true sbt/sbt clean assembly --- 
--Running the example using Spark-shell --- 
$ 
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
 HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client 
./bin/spark-shell scala> val sqlContext = new 
org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class Person(name: 
String, age: Int) val people = 
sc.textFile("hdfs://myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt").map(_.split(",")).map(p
 => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") val 
teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") 
teenagers.map(t => "Name: " + t(0)).collect().foreach(println) 
-- error 
--- java.lang.NoClassDefFoundError: 
Could not initialize class $line10.$read$ at 
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at 
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at 
scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at 
scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
 at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
 at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at 
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at 
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) at 
org.apache.spark.scheduler.Task.run(Task.scala:51) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744)

View this message in context: Spark SQL - java.lang.NoClassDefFoundError: Could 
not initialize class 
$line10.$read$
Sent from the Apache Spark User List mailing list 
archive at Nabble.com.



Error in JavaKafkaWordCount.java example

2014-07-13 Thread Mahebub Sayyed
Hello,

I am referring following example:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

I am getting following C*ompilation Error* :
\example\JavaKafkaWordCount.java:[62,70] error: cannot access ClassTag

Please help me.
Thanks in advance.

-- 
*Regards,*
*Mahebub Sayyed*


Problem reading in LZO compressed files

2014-07-13 Thread Ognen Duzlevski

Hello,

I have been trying to play with the Google ngram dataset provided by 
Amazon in form of LZO compressed files.


I am having trouble understanding what is going on ;). I have added the 
compression jar and native library to the underlying Hadoop/HDFS 
installation, restarted the name node and the datanodes, Spark can 
obviously see the file but I get gibberish on a read. Any ideas?


See output below:

14/07/13 14:39:19 INFO SparkContext: Added JAR 
file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at 
http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with 
timestamp 1405262359777

14/07/13 14:39:20 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with 
curMem=0, maxMem=311387750
14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 160.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
:12


scala> f.take(10)
14/07/13 14:39:43 INFO SparkContext: Job finished: take at :15, 
took 0.419708348 s
res0: Array[String] = 
Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec���\?�?�?�m??��??hx??�??�???�??�??�??�??�??�? 
�?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �? 
�?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?�?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...


Thanks!
Ognen


Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
If you’re still seeing gibberish, it’s because Spark is not using the LZO
libraries properly. In your case, I believe you should be calling
newAPIHadoopFile() instead of textFile().

For example:

sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
  classOf[com.hadoop.mapreduce.LzoTextInputFormat],
  classOf[org.apache.hadoop.io.LongWritable],
  classOf[org.apache.hadoop.io.Text])

On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier to
read LZO-compressed files from EC2 clusters


Nick
​


On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski  wrote:

> Hello,
>
> I have been trying to play with the Google ngram dataset provided by
> Amazon in form of LZO compressed files.
>
> I am having trouble understanding what is going on ;). I have added the
> compression jar and native library to the underlying Hadoop/HDFS
> installation, restarted the name node and the datanodes, Spark can
> obviously see the file but I get gibberish on a read. Any ideas?
>
> See output below:
>
> 14/07/13 14:39:19 INFO SparkContext: Added JAR file:/home/ec2-user/hadoop/
> lib/hadoop-gpl-compression-0.1.0.jar at http://10.10.0.100:40100/jars/
> hadoop-gpl-compression-0.1.0.jar with timestamp 1405262359777
> 14/07/13 14:39:20 INFO SparkILoop: Created spark context..
> Spark context available as sc.
>
> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
> curMem=0, maxMem=311387750
> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 160.0 KB, free 296.8 MB)
> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
> :12
>
> scala> f.take(10)
> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at :15,
> took 0.419708348 s
> res0: Array[String] = Array(SEQ?!org.apache.hadoop.
> io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.
> compression.lzo.LzoCodec���\ 
> 
> ?�?�?�m??��??hx??�??�???�??�??�??�??�??�?
> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?
> 4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?
> H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?
> \�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?
> p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>
> Thanks!
> Ognen
>


Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan
Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list
concatenation operations, and found that the performance becomes even worse.
So groupByKey is not that bad in my code.

Best regards,
- Guanhua



From:  Aaron Davidson 
Reply-To:  
Date:  Sat, 12 Jul 2014 16:32:22 -0700
To:  
Subject:  Re: Confused by groupByKey() and the default partitioner

Yes, groupByKey() does partition by the hash of the key unless you specify a
custom Partitioner.

(1) If you were to use groupByKey() when the data was already partitioned
correctly, the data would indeed not be shuffled. Here is the associated
code, you'll see that it simply checks that the Partitioner the groupBy() is
looking for is equal to the Partitioner of the pre-existing RDD:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/s
park/rdd/PairRDDFunctions.scala#L89

By the way, I should warn you that groupByKey() is not a recommended
operation if you can avoid it, as it has non-obvious performance issues when
running with serious data.


On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan  wrote:
> Hi:
> 
> I have trouble understanding the default partitioner (hash) in Spark. Suppose
> that an RDD with two partitions is created as follows:
> x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2)
> Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by
> default?
> (1) Assuming this is correct, if I further use the groupByKey primitive,
> x.groupByKey(), all the records sharing the same key should be located in the
> same partition. Then it's not necessary to shuffle the data records around, as
> all the grouping operations can be done locally.
> (2) If it's not true, how could I specify a partitioner simply based on the
> hashing of the key (in Python)?
> Thank you,
> - Guanhua





Re: Problem reading in LZO compressed files

2014-07-13 Thread Ognen Duzlevski

Nicholas,

Thanks!

How do I make spark assemble against a local version of Hadoop?

I have 2.4.1 running on a test cluster and I did 
"SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in 
hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1 
HDFS). I am guessing my local version of Hadoop libraries/jars is not 
used. Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar 
(responsible for the lzo stuff) to this hand assembled Spark?


I am running the spark-shell like this:
bin/spark-shell --jars 
/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar


and getting this:

scala> val f = 
sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with 
curMem=0, maxMem=311387750
14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 211.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, 
org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at 
:12


scala> f.take(1)
14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.JobContext, but class was expected
at 
com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)


which makes me think something is not linked to something properly (not 
a Java expert unfortunately).


Thanks!
Ognen


On 7/13/14, 10:35 AM, Nicholas Chammas wrote:


If you’re still seeing gibberish, it’s because Spark is not using the 
LZO libraries properly. In your case, I believe you should be calling 
|newAPIHadoopFile()| instead of |textFile()|.


For example:

|sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
   classOf[com.hadoop.mapreduce.LzoTextInputFormat],
   classOf[org.apache.hadoop.io.LongWritable],
   classOf[org.apache.hadoop.io.Text])
|

On a side note, here’s a related JIRA issue: SPARK-2394: Make it 
easier to read LZO-compressed files from EC2 clusters 



Nick

​


On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski 
mailto:ognen.duzlev...@gmail.com>> wrote:


Hello,

I have been trying to play with the Google ngram dataset provided
by Amazon in form of LZO compressed files.

I am having trouble understanding what is going on ;). I have
added the compression jar and native library to the underlying
Hadoop/HDFS installation, restarted the name node and the
datanodes, Spark can obviously see the file but I get gibberish on
a read. Any ideas?

See output below:

14/07/13 14:39:19 INFO SparkContext: Added JAR
file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar
with timestamp 1405262359777
14/07/13 14:39:20 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo
")
14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called
with curMem=0, maxMem=311387750
14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 160.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
:12

scala> f.take(10)
14/07/13 14:39:43 INFO SparkContext: Job finished: take at
:15, took 0.419708348 s
res0: Array[String] =

Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec���\�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
�?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...

Thanks!
Ognen






Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Aaron Davidson
Ah -- I should have been more clear, list concatenation isn't going to be
any faster. In many cases I've seen people use groupByKey() when they are
really trying to do some sort of aggregation. and thus constructing this
concatenated list is more expensive than they need.


On Sun, Jul 13, 2014 at 9:13 AM, Guanhua Yan  wrote:

> Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list
> concatenation operations, and found that the performance becomes even
> worse. So groupByKey is not that bad in my code.
>
> Best regards,
> - Guanhua
>
>
>
> From: Aaron Davidson 
> Reply-To: 
> Date: Sat, 12 Jul 2014 16:32:22 -0700
> To: 
> Subject: Re: Confused by groupByKey() and the default partitioner
>
> Yes, groupByKey() does partition by the hash of the key unless you specify
> a custom Partitioner.
>
> (1) If you were to use groupByKey() when the data was already partitioned
> correctly, the data would indeed not be shuffled. Here is the associated
> code, you'll see that it simply checks that the Partitioner the groupBy()
> is looking for is equal to the Partitioner of the pre-existing RDD:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L89
>
> By the way, I should warn you that groupByKey() is not a recommended
> operation if you can avoid it, as it has non-obvious performance issues
> when running with serious data.
>
>
> On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan  wrote:
>
>> Hi:
>>
>> I have trouble understanding the default partitioner (hash) in Spark.
>> Suppose that an RDD with two partitions is created as follows:
>>
>> x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2)
>>
>> Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by 
>> default?
>>
>> (1) Assuming this is correct, if I further use the groupByKey primitive, 
>> x.groupByKey(), all the records sharing the same key should be located in 
>> the same partition. Then it's not necessary to shuffle the data records 
>> around, as all the grouping operations can be done locally.
>>
>> (2) If it's not true, how could I specify a partitioner simply based on the 
>> hashing of the key (in Python)?
>>
>> Thank you,
>>
>> - Guanhua
>>
>>
>


Re: can't print DStream after reduce

2014-07-13 Thread Tathagata Das
Try doing DStream.foreachRDD and then printing the RDD count and further
inspecting the RDD.
On Jul 13, 2014 1:03 AM, "Walrus theCat"  wrote:

> Hi,
>
> I have a DStream that works just fine when I say:
>
> dstream.print
>
> If I say:
>
> dstream.map(_,1).print
>
> that works, too.  However, if I do the following:
>
> dstream.reduce{case(x,y) => x}.print
>
> I don't get anything on my console.  What's going on?
>
> Thanks
>


Re: spark ui on yarn

2014-07-13 Thread Koert Kuipers
my yarn environment does have less memory for the executors.

i am checking if the RDDs are cached by calling sc.getRDDStorageInfo, which
shows an RDD as fully cached in memory, yet it does not show up in the UI


On Sun, Jul 13, 2014 at 1:49 AM, Matei Zaharia 
wrote:

> The UI code is the same in both, but one possibility is that your
> executors were given less memory on YARN. Can you check that? Or otherwise,
> how do you know that some RDDs were cached?
>
> Matei
>
> On Jul 12, 2014, at 4:12 PM, Koert Kuipers  wrote:
>
> hey shuo,
> so far all stage links work fine for me.
>
> i did some more testing, and it seems kind of random what shows up on the
> gui and what does not. some partially cached RDDs make it to the GUI, while
> some fully cached ones do not. I have not been able to detect a pattern.
>
> is the codebase for the gui different in standalone than in yarn-client
> mode?
>
>
> On Sat, Jul 12, 2014 at 3:34 AM, Shuo Xiang 
> wrote:
>
>> Hi Koert,
>>   Just curious did you find any information like "CANNOT FIND ADDRESS"
>> after clicking into some stage? I've seen similar problems due to lost of
>> executors.
>>
>> Best,
>>
>>
>>
>> On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers  wrote:
>>
>>> I just tested a long lived application (that we normally run in
>>> standalone mode) on yarn in client mode.
>>>
>>> it looks to me like cached rdds are missing in the storage tap of the ui.
>>>
>>> accessing the rdd storage information via the spark context shows rdds
>>> as fully cached but they are missing on storage page.
>>>
>>> spark 1.0.0
>>>
>>
>>
>
>


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Update on this:

val lines = ssc.socketTextStream("localhost", )

lines.print // works

lines.map(_->1).print // works

lines.map(_->1).reduceByKey(_+_).print // nothing printed to driver console

Just lots of:

14/07/13 11:37:40 INFO receiver.BlockGenerator: Pushed block
input-0-1405276660400
14/07/13 11:37:41 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks
14/07/13 11:37:41 INFO scheduler.JobScheduler: Added jobs for time
1405276661000 ms
14/07/13 11:37:41 INFO storage.MemoryStore: ensureFreeSpace(60) called with
curMem=1275, maxMem=98539929
14/07/13 11:37:41 INFO storage.MemoryStore: Block input-0-1405276661400
stored as bytes to memory (size 60.0 B, free 94.0 MB)
14/07/13 11:37:41 INFO storage.BlockManagerInfo: Added
input-0-1405276661400 in memory on 25.17.218.118:55820 (size: 60.0 B, free:
94.0 MB)
14/07/13 11:37:41 INFO storage.BlockManagerMaster: Updated info of block
input-0-1405276661400


Any insight?

Thanks


On Sun, Jul 13, 2014 at 1:03 AM, Walrus theCat 
wrote:

> Hi,
>
> I have a DStream that works just fine when I say:
>
> dstream.print
>
> If I say:
>
> dstream.map(_,1).print
>
> that works, too.  However, if I do the following:
>
> dstream.reduce{case(x,y) => x}.print
>
> I don't get anything on my console.  What's going on?
>
> Thanks
>


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Thanks for your interest.

lines.foreachRDD(x => println(x.count))

And I got 0 every once in a while (which I think is strange, because
lines.print prints the input I'm giving it over the socket.)


When I tried:

lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))

I got no count.

Thanks


On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das  wrote:

> Try doing DStream.foreachRDD and then printing the RDD count and further
> inspecting the RDD.
> On Jul 13, 2014 1:03 AM, "Walrus theCat"  wrote:
>
>> Hi,
>>
>> I have a DStream that works just fine when I say:
>>
>> dstream.print
>>
>> If I say:
>>
>> dstream.map(_,1).print
>>
>> that works, too.  However, if I do the following:
>>
>> dstream.reduce{case(x,y) => x}.print
>>
>> I don't get anything on my console.  What's going on?
>>
>> Thanks
>>
>


SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi 

I am running into trouble with a nested query using python. To try and debug
it, I first wrote the query I want using sqlite3

select freq.docid, freqTranspose.docid, sum(freq.count *
freqTranspose.count) from
   Frequency as freq,
   (select term, docid, count from Frequency) as freqTranspose
   where freq.term = freqTranspose.term
   group by freq.docid, freqTranspose.docid
   ;


Sparksql has trouble parsing the "(select ) as freqTranspose ³ line

Here is what my input data looks like
$ head -n 3 reuters.db.csv
docid,term,count
1_txt_earn,net,1
1_txt_earn,rogers,4

The output from sqlite3 is
$ head -n 6 3hSimilarityMatrix.slow.sql.out
freq.docid  freqTranspose.docid  sum(freq.count * freqTranspose.count)
--  ---  -
1_txt_earn  1_txt_earn   127
1_txt_earn  10054_txt_earn   33
1_txt_earn  10080_txt_crude  146
1_txt_earn  10088_txt_acq11
$ 


My code example pretty much follows
http://spark.apache.org/docs/latest/sql-programming-guide.html

dataFile = sc.textFile("reuters.db.csv²)
lines = dataFile.map(lambda l: l.split(",²))
def mapLines(line) :
ret = {}
ret['docid'] = line[0]
ret['term'] = line[1]
ret['count'] = line[2]
return ret
frequency = lines.map(mapLines)
schemaFrequency = sqlContext.inferSchema(frequency)
schemaFrequency.registerAsTable("frequency²)

Okay here is where I run into trouble

sqlCmd = "select \
freq.docid, \
freqTranspose.docid \
  from \
  frequency as freq, \
  (select term, docid, count from frequency)  \
"
similarities = sqlContext.sql(sqlCmd)


/Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib
/py4j-0.8.1-src.zip/py4j/protocol.py in get_return_value(answer,
gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o40.sql.
: java.lang.RuntimeException: [1.153] failure: ``('' expected but `from'
found

select freq.docid, freqTranspose.docid
from   frequency as freq,   (select term, docid,
count from frequency)
   

Simple sql seems to ³parse² I.e. Select  freq.docid from frequency as freq

Any suggestions would be greatly appreciated.

Andy

P.s. I should note, I think I am using version 1.0 ?






Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Michael Armbrust
Hi Andy,

The SQL parser is pretty basic (we plan to improve this for the 1.2
release).  In this case I think part of the problem is that one of your
variables is "count", which is a reserved word.  Unfortunately, we don't
have the ability to escape identifiers at this point.

However, I did manage to get your query to parse using the HiveQL parser,
provided by HiveContext.

hiveCtx.hql("""
select freq.docid, freqTranspose.docid, sum(freq.count *
freqTranspose.count) from
   Frequency freq JOIN
   (select term, docid, count from Frequency) freqTranspose
   where freq.term = freqTranspose.term
   group by freq.docid, freqTranspose.docid""")

Michael


On Sun, Jul 13, 2014 at 12:43 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi
>
> I am running into trouble with a nested query using python. To try and
> debug it, I first wrote the query I want using sqlite3
>
> select freq.docid, freqTranspose.docid, sum(freq.count *
> freqTranspose.count) from
>Frequency as freq,
>(select term, docid, count from Frequency) as freqTranspose
>where freq.term = freqTranspose.term
>group by freq.docid, freqTranspose.docid
>;
>
>
> Sparksql has trouble parsing the "(select ) as freqTranspose “ line
>
> Here is what my input data looks like
> $ head -n 3 reuters.db.csv
> docid,term,count
> 1_txt_earn,net,1
> 1_txt_earn,rogers,4
>
> The output from sqlite3 is
> $ head -n 6 3hSimilarityMatrix.slow.sql.out
> freq.docid  freqTranspose.docid  sum(freq.count * freqTranspose.count)
> --  ---  -
> 1_txt_earn  1_txt_earn   127
> 1_txt_earn  10054_txt_earn   33
> 1_txt_earn  10080_txt_crude  146
> 1_txt_earn  10088_txt_acq11
> $
>
>
> My code example pretty much follows
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> dataFile = sc.textFile("reuters.db.csv”)
> lines = dataFile.map(lambda l: l.split(",”))
> def mapLines(line) :
> ret = {}
> ret['docid'] = line[0]
> ret['term'] = line[1]
> ret['count'] = line[2]
> return ret
> frequency = lines.map(mapLines)
> schemaFrequency = sqlContext.inferSchema(frequency)
> schemaFrequency.registerAsTable("frequency”)
>
> Okay here is where I run into trouble
>
> sqlCmd = "select \
> freq.docid, \
> freqTranspose.docid \
>   from \
>   frequency as freq, \
>   (select term, docid, count from frequency)  \
> "
> similarities = sqlContext.sql(sqlCmd)
>
>
> /Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)298  
>raise Py4JJavaError(299 'An error occurred 
> while calling {0}{1}{2}.\n'.--> 300 format(target_id, 
> '.', name), value)301 else:302 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o40.sql.
> : java.lang.RuntimeException: [1.153] failure: ``('' expected but `from' found
>
> select freq.docid, freqTranspose.docid   from 
>   frequency as freq,   (select term, docid, count 
> from frequency)
>
>
>
> Simple sql seems to “parse” I.e. Select  freq.docid from frequency as freq
>
>
> Any suggestions would be greatly appreciated.
>
>
> Andy
>
>
> P.s. I should note, I think I am using version 1.0 ?
>
>
>
>


Task serialized size dependent on size of RDD?

2014-07-13 Thread Sébastien Rainville
Hi,

I'm having trouble serializing tasks for this code:

val rddC = (rddA join rddB)
  .map { case (x, (y, z)) => z -> y }
  .reduceByKey( { (y1, y2) => Semigroup.plus(y1, y2) }, 1000)

Somehow when running on a small data set the size of the serialized task is
about 650KB, which is very big, and when running on a big data set it's
about 1.3MB, which is huge. I can't find what's causing it. The only
reference to an object outside the scope of the closure is to a static
method on Semigroup, which should serialize fine. The proof that it's okay
to call Semigroup like that is that I have another operation in the same
job that uses it and that's serializes okay (~30KB).

The part I really don't get is why is the size of the serialized task
dependent on the size of the RDD? I haven't used "parallelize" to create
rddA and rddB, but rather derived them from transformations of hive tables.

Thanks,

- Sebastien


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
More strange behavior:

lines.foreachRDD(x => println(x.first)) // works
lines.foreachRDD(x => println((x.count,x.first))) // no output is printed
to driver console




On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
wrote:

>
> Thanks for your interest.
>
> lines.foreachRDD(x => println(x.count))
>
> And I got 0 every once in a while (which I think is strange, because
> lines.print prints the input I'm giving it over the socket.)
>
>
> When I tried:
>
> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>
> I got no count.
>
> Thanks
>
>
> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Try doing DStream.foreachRDD and then printing the RDD count and further
>> inspecting the RDD.
>>  On Jul 13, 2014 1:03 AM, "Walrus theCat"  wrote:
>>
>>> Hi,
>>>
>>> I have a DStream that works just fine when I say:
>>>
>>> dstream.print
>>>
>>> If I say:
>>>
>>> dstream.map(_,1).print
>>>
>>> that works, too.  However, if I do the following:
>>>
>>> dstream.reduce{case(x,y) => x}.print
>>>
>>> I don't get anything on my console.  What's going on?
>>>
>>> Thanks
>>>
>>
>


Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi Michael

Changing my col name to something other the Œcount¹ . Fixed the parse error

Many thanks, 

Andy

From:  Michael Armbrust 
Reply-To:  
Date:  Sunday, July 13, 2014 at 1:18 PM
To:  
Cc:  "u...@spark.incubator.apache.org" 
Subject:  Re: SparkSql newbie problems with nested selects

> Hi Andy,
> 
> The SQL parser is pretty basic (we plan to improve this for the 1.2 release).
> In this case I think part of the problem is that one of your variables is
> "count", which is a reserved word.  Unfortunately, we don't have the ability
> to escape identifiers at this point.
> 
> However, I did manage to get your query to parse using the HiveQL parser,
> provided by HiveContext.
> 
> hiveCtx.hql("""
> select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count)
> from 
>Frequency freq JOIN
>(select term, docid, count from Frequency) freqTranspose
>where freq.term = freqTranspose.term
>group by freq.docid, freqTranspose.docid""")
> 
> Michael
> 
> 
> On Sun, Jul 13, 2014 at 12:43 PM, Andy Davidson
>  wrote:
>> Hi 
>> 
>> I am running into trouble with a nested query using python. To try and debug
>> it, I first wrote the query I want using sqlite3
>> 
>> select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count)
>> from 
>>Frequency as freq,
>>(select term, docid, count from Frequency) as freqTranspose
>>where freq.term = freqTranspose.term
>>group by freq.docid, freqTranspose.docid
>>;
>> 
>> 
>> Sparksql has trouble parsing the "(select ) as freqTranspose ³ line
>> 
>> Here is what my input data looks like
>> $ head -n 3 reuters.db.csv
>> docid,term,count
>> 1_txt_earn,net,1
>> 1_txt_earn,rogers,4
>> 
>> The output from sqlite3 is
>> $ head -n 6 3hSimilarityMatrix.slow.sql.out
>> freq.docid  freqTranspose.docid  sum(freq.count * freqTranspose.count)
>> --  ---  -
>> 1_txt_earn  1_txt_earn   127
>> 1_txt_earn  10054_txt_earn   33
>> 1_txt_earn  10080_txt_crude  146
>> 1_txt_earn  10088_txt_acq11
>> $ 
>> 
>> 
>> My code example pretty much follows
>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>> 
>> dataFile = sc.textFile("reuters.db.csv²)
>> lines = dataFile.map(lambda l: l.split(",²))
>> def mapLines(line) :
>> ret = {}
>> ret['docid'] = line[0]
>> ret['term'] = line[1]
>> ret['count'] = line[2]
>> return ret
>> frequency = lines.map(mapLines)
>> schemaFrequency = sqlContext.inferSchema(frequency)
>> schemaFrequency.registerAsTable("frequency²)
>> 
>> Okay here is where I run into trouble
>> 
>> sqlCmd = "select \
>> freq.docid, \
>> freqTranspose.docid \
>>   from \
>>   frequency as freq, \
>>   (select term, docid, count from frequency)  \
>> "
>> similarities = sqlContext.sql(sqlCmd)
>> 
>> 
>> /Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib/
>> py4j-0.8.1-src.zip/py4j/protocol.py in get_return_value(answer,
>> gateway_client, target_id, name)298 raise Py4JJavaError(
>> 299 'An error occurred while calling
>> {0}{1}{2}.\n'.--> 300 format(target_id, '.', name),
>> value)
>> 301 else:302 raise Py4JError(
>> 
>> Py4JJavaError: An error occurred while calling o40.sql.
>> : java.lang.RuntimeException: [1.153] failure: ``('' expected but `from'
>> found
>> 
>> select freq.docid, freqTranspose.docid   from
>> frequency as freq,   (select term, docid, count from frequency)
>> 
>> 
>> Simple sql seems to ³parse² I.e. Select  freq.docid from frequency as freq
>> 
>> Any suggestions would be greatly appreciated.
>> 
>> Andy
>> 
>> P.s. I should note, I think I am using version 1.0 ?
>> 
>> 
> 




Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Great success!

I was able to get output to the driver console by changing the construction
of the Streaming Spark Context from:

 val ssc = new StreamingContext("local" /**TODO change once a cluster is up
**/,
"AppName", Seconds(1))


to:

val ssc = new StreamingContext("local[2]" /**TODO change once a cluster is
up **/,
"AppName", Seconds(1))


I found something that tipped me off that this might work by digging
through this mailing list.


On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
wrote:

> More strange behavior:
>
> lines.foreachRDD(x => println(x.first)) // works
> lines.foreachRDD(x => println((x.count,x.first))) // no output is printed
> to driver console
>
>
>
>
> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
> wrote:
>
>>
>> Thanks for your interest.
>>
>> lines.foreachRDD(x => println(x.count))
>>
>> And I got 0 every once in a while (which I think is strange, because
>> lines.print prints the input I'm giving it over the socket.)
>>
>>
>> When I tried:
>>
>> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>>
>> I got no count.
>>
>> Thanks
>>
>>
>> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Try doing DStream.foreachRDD and then printing the RDD count and further
>>> inspecting the RDD.
>>>  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
>>> wrote:
>>>
 Hi,

 I have a DStream that works just fine when I say:

 dstream.print

 If I say:

 dstream.map(_,1).print

 that works, too.  However, if I do the following:

 dstream.reduce{case(x,y) => x}.print

 I don't get anything on my console.  What's going on?

 Thanks

>>>
>>
>


Ideal core count within a single JVM

2014-07-13 Thread lokesh.gidra
Hello,

What would be an ideal core count to run a spark job in local mode to get
best utilization of CPU? Actually I have a 48-core machine but the
performance of local[48] is poor as compared to local[10].


Lokesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ideal-core-count-within-a-single-JVM-tp9566.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: can't print DStream after reduce

2014-07-13 Thread Michael Campbell
This almost had me not using Spark; I couldn't get any output.  It is not
at all obvious what's going on here to the layman (and to the best of my
knowledge, not documented anywhere), but now you know you'll be able to
answer this question for the numerous people that will also have it.


On Sun, Jul 13, 2014 at 5:24 PM, Walrus theCat 
wrote:

> Great success!
>
> I was able to get output to the driver console by changing the
> construction of the Streaming Spark Context from:
>
>  val ssc = new StreamingContext("local" /**TODO change once a cluster is
> up **/,
> "AppName", Seconds(1))
>
>
> to:
>
> val ssc = new StreamingContext("local[2]" /**TODO change once a cluster is
> up **/,
> "AppName", Seconds(1))
>
>
> I found something that tipped me off that this might work by digging
> through this mailing list.
>
>
> On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
> wrote:
>
>> More strange behavior:
>>
>> lines.foreachRDD(x => println(x.first)) // works
>> lines.foreachRDD(x => println((x.count,x.first))) // no output is printed
>> to driver console
>>
>>
>>
>>
>> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
>> wrote:
>>
>>>
>>> Thanks for your interest.
>>>
>>> lines.foreachRDD(x => println(x.count))
>>>
>>>  And I got 0 every once in a while (which I think is strange, because
>>> lines.print prints the input I'm giving it over the socket.)
>>>
>>>
>>> When I tried:
>>>
>>> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>>>
>>> I got no count.
>>>
>>> Thanks
>>>
>>>
>>> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Try doing DStream.foreachRDD and then printing the RDD count and
 further inspecting the RDD.
  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
 wrote:

> Hi,
>
> I have a DStream that works just fine when I say:
>
> dstream.print
>
> If I say:
>
> dstream.map(_,1).print
>
> that works, too.  However, if I do the following:
>
> dstream.reduce{case(x,y) => x}.print
>
> I don't get anything on my console.  What's going on?
>
> Thanks
>

>>>
>>
>


Re: not getting output from socket connection

2014-07-13 Thread Michael Campbell
Make sure you use "local[n]" (where n > 1) in your context setup too, (if
you're running locally), or you won't get output.


On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat 
wrote:

> Thanks!
>
> I thought it would get "passed through" netcat, but given your email, I
> was able to follow this tutorial and get it to work:
>
> http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html
>
>
>
>
> On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen  wrote:
>
>> netcat is listening for a connection on port . It is echoing what
>> you type to its console to anything that connects to  and reads.
>> That is what Spark streaming does.
>>
>> If you yourself connect to  and write, nothing happens except that
>> netcat echoes it. This does not cause Spark to somehow get that data.
>> nc is only echoing input from the console.
>>
>> On Fri, Jul 11, 2014 at 9:25 PM, Walrus theCat 
>> wrote:
>> > Hi,
>> >
>> > I have a java application that is outputting a string every second.  I'm
>> > running the wordcount example that comes with Spark 1.0, and running nc
>> -lk
>> > . When I type words into the terminal running netcat, I get counts.
>> > However, when I write the String onto a socket on port , I don't get
>> > counts.  I can see the strings showing up in the netcat terminal, but no
>> > counts from Spark.  If I paste in the string, I get counts.
>> >
>> > Any ideas?
>> >
>> > Thanks
>>
>
>


Re: can't print DStream after reduce

2014-07-13 Thread Sean Owen
How about a PR that rejects a context configured for local or local[1]? As
I understand it is not intended to work and has bitten several people.
On Jul 14, 2014 12:24 AM, "Michael Campbell" 
wrote:

> This almost had me not using Spark; I couldn't get any output.  It is not
> at all obvious what's going on here to the layman (and to the best of my
> knowledge, not documented anywhere), but now you know you'll be able to
> answer this question for the numerous people that will also have it.
>
>
> On Sun, Jul 13, 2014 at 5:24 PM, Walrus theCat 
> wrote:
>
>> Great success!
>>
>> I was able to get output to the driver console by changing the
>> construction of the Streaming Spark Context from:
>>
>>  val ssc = new StreamingContext("local" /**TODO change once a cluster is
>> up **/,
>> "AppName", Seconds(1))
>>
>>
>> to:
>>
>> val ssc = new StreamingContext("local[2]" /**TODO change once a cluster
>> is up **/,
>> "AppName", Seconds(1))
>>
>>
>> I found something that tipped me off that this might work by digging
>> through this mailing list.
>>
>>
>> On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
>> wrote:
>>
>>> More strange behavior:
>>>
>>> lines.foreachRDD(x => println(x.first)) // works
>>> lines.foreachRDD(x => println((x.count,x.first))) // no output is
>>> printed to driver console
>>>
>>>
>>>
>>>
>>> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
>>> wrote:
>>>

 Thanks for your interest.

 lines.foreachRDD(x => println(x.count))

  And I got 0 every once in a while (which I think is strange, because
 lines.print prints the input I'm giving it over the socket.)


 When I tried:

 lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))

 I got no count.

 Thanks


 On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Try doing DStream.foreachRDD and then printing the RDD count and
> further inspecting the RDD.
>  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
> wrote:
>
>> Hi,
>>
>> I have a DStream that works just fine when I say:
>>
>> dstream.print
>>
>> If I say:
>>
>> dstream.map(_,1).print
>>
>> that works, too.  However, if I do the following:
>>
>> dstream.reduce{case(x,y) => x}.print
>>
>> I don't get anything on my console.  What's going on?
>>
>> Thanks
>>
>

>>>
>>
>


Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
I actually never got this to work, which is part of the reason why I filed
that JIRA. Apart from using --jar when starting the shell, I don’t have any
more pointers for you. :(
​


On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski  wrote:

>  Nicholas,
>
> Thanks!
>
> How do I make spark assemble against a local version of Hadoop?
>
> I have 2.4.1 running on a test cluster and I did
> "SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly" but all it did was pull in
> hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1
> HDFS). I am guessing my local version of Hadoop libraries/jars is not used.
> Alternatively, how do I add the hadoop-gpl-compression-0.1.0.jar
> (responsible for the lzo stuff) to this hand assembled Spark?
>
> I am running the spark-shell like this:
> bin/spark-shell --jars
> /home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar
>
> and getting this:
>
> scala> val f = sc.newAPIHadoopFile("hdfs://10.10.0.98:54310/data/1gram.lzo
> ",classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
> 14/07/13 16:53:01 INFO MemoryStore: ensureFreeSpace(216014) called with
> curMem=0, maxMem=311387750
> 14/07/13 16:53:01 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 211.0 KB, free 296.8 MB)
> f: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at
> :12
>
> scala> f.take(1)
> 14/07/13 16:53:08 INFO FileInputFormat: Total input paths to process : 1
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
> at
> com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:67)
>
> which makes me think something is not linked to something properly (not a
> Java expert unfortunately).
>
> Thanks!
> Ognen
>
>
>
> On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
>
>  If you’re still seeing gibberish, it’s because Spark is not using the
> LZO libraries properly. In your case, I believe you should be calling
> newAPIHadoopFile() instead of textFile().
>
> For example:
>
> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data",
>   classOf[com.hadoop.mapreduce.LzoTextInputFormat],
>   classOf[org.apache.hadoop.io.LongWritable],
>   classOf[org.apache.hadoop.io.Text])
>
> On a side note, here’s a related JIRA issue: SPARK-2394: Make it easier
> to read LZO-compressed files from EC2 clusters
> 
>
> Nick
> ​
>
>
> On Sun, Jul 13, 2014 at 10:49 AM, Ognen Duzlevski <
> ognen.duzlev...@gmail.com> wrote:
>
>> Hello,
>>
>> I have been trying to play with the Google ngram dataset provided by
>> Amazon in form of LZO compressed files.
>>
>> I am having trouble understanding what is going on ;). I have added the
>> compression jar and native library to the underlying Hadoop/HDFS
>> installation, restarted the name node and the datanodes, Spark can
>> obviously see the file but I get gibberish on a read. Any ideas?
>>
>> See output below:
>>
>> 14/07/13 14:39:19 INFO SparkContext: Added JAR
>> file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
>> http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with
>> timestamp 1405262359777
>> 14/07/13 14:39:20 INFO SparkILoop: Created spark context..
>> Spark context available as sc.
>>
>> scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
>> 14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
>> curMem=0, maxMem=311387750
>> 14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values to
>> memory (estimated size 160.0 KB, free 296.8 MB)
>> f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
>> :12
>>
>> scala> f.take(10)
>> 14/07/13 14:39:43 INFO SparkContext: Job finished: take at :15,
>> took 0.419708348 s
>> res0: Array[String] =
>> Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec���\> ?�?�?�m??��??hx??�??�???�??�??�??�??�??�?
>> �?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
>> �?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?
>> �?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
>>
>> Thanks!
>> Ognen
>>
>
>
>


Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
In case you still have issues with duplicate files in uber jar, here is a
reference sbt file with assembly plugin that deals with duplicates

https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt


On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay 
wrote:

> You may try to use this one:
>
> https://github.com/sbt/sbt-assembly
>
> I had an issue of duplicate files in the uber jar file. But I think this
> library will assemble dependencies into a single jar file.
>
> Bill
>
>
> On Fri, Jul 11, 2014 at 1:34 AM, Dilip  wrote:
>
>>  A simple
>> sbt assembly
>> is not working. Is there any other way to include particular jars with
>> assembly command?
>>
>> Regards,
>> Dilip
>>
>> On Friday 11 July 2014 12:45 PM, Bill Jay wrote:
>>
>> I have met similar issues. The reason is probably because in Spark
>> assembly, spark-streaming-kafka is not included. Currently, I am using
>> Maven to generate a shaded package with all the dependencies. You may try
>> to use sbt assembly to include the dependencies in your jar file.
>>
>>  Bill
>>
>>
>> On Thu, Jul 10, 2014 at 11:48 PM, Dilip  wrote:
>>
>>>  Hi Akhil,
>>>
>>> Can you please guide me through this? Because the code I am running
>>> already has this in it:
>>> [java]
>>>
>>> SparkContext sc = new SparkContext();
>>>
>>> sc.addJar("/usr/local/spark/external/kafka/target/scala-2.10/spark-streaming-kafka_2.10-1.1.0-SNAPSHOT.jar");
>>>
>>>
>>> Is there something I am missing?
>>>
>>> Thanks,
>>> Dilip
>>>
>>>
>>> On Friday 11 July 2014 12:02 PM, Akhil Das wrote:
>>>
>>>  Easiest fix would be adding the kafka jars to the SparkContext while
>>> creating it.
>>>
>>>  Thanks
>>> Best Regards
>>>
>>>
>>> On Fri, Jul 11, 2014 at 4:39 AM, Dilip  wrote:
>>>
 Hi,

 I am trying to run a program with spark streaming using Kafka on a
 stand alone system. These are my details:

 Spark 1.0.0 hadoop2
 Scala 2.10.3

 I am trying a simple program using my custom sbt project but this is
 the error I am getting:

 Exception in thread "main" java.lang.NoClassDefFoundError:
 kafka/serializer/StringDecoder
 at
 org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:55)
 at
 org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:94)
 at
 org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
 at SimpleJavaApp.main(SimpleJavaApp.java:40)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException:
 kafka.serializer.StringDecoder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 11 more


 here is my .sbt file:

 name := "Simple Project"

 version := "1.0"

 scalaVersion := "2.10.3"

 libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"

 libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.0.0"

 libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"

 libraryDependencies += "org.apache.spark" %% "spark-examples" % "1.0.0"

 libraryDependencies += "org.apache.spark" %
 "spark-streaming-kafka_2.10" % "1.0.0"

 libraryDependencies += "org.apache.kafka" %% "kafka" % "0.8.0"

 resolvers += "Akka Repository" at "http://repo.akka.io/releases/";

 resolvers += "Maven Repository" at "http://central.maven.org/maven2/";


 sbt package was successful. I also tried sbt "++2.10.3 package" to
 build it for my scala version. Problem remains the same.
 Can anyone help me out here? Ive been stuck on this for quite some time
 now.

 Thank You,
 Dilip

>>>
>>>
>>>
>>
>>
>


Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
Alsom the reason the spark-streaming-kafka is not included in the spark
assembly is that we do not want dependencies of external systems like kafka
(which itself probably has a complex dependency tree) to cause conflict
with the core spark's functionality and stability.

TD


On Sun, Jul 13, 2014 at 5:48 PM, Tathagata Das 
wrote:

> In case you still have issues with duplicate files in uber jar, here is a
> reference sbt file with assembly plugin that deals with duplicates
>
>
> https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt
>
>
> On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay 
> wrote:
>
>> You may try to use this one:
>>
>> https://github.com/sbt/sbt-assembly
>>
>> I had an issue of duplicate files in the uber jar file. But I think this
>> library will assemble dependencies into a single jar file.
>>
>> Bill
>>
>>
>> On Fri, Jul 11, 2014 at 1:34 AM, Dilip  wrote:
>>
>>>  A simple
>>> sbt assembly
>>> is not working. Is there any other way to include particular jars with
>>> assembly command?
>>>
>>> Regards,
>>> Dilip
>>>
>>> On Friday 11 July 2014 12:45 PM, Bill Jay wrote:
>>>
>>> I have met similar issues. The reason is probably because in Spark
>>> assembly, spark-streaming-kafka is not included. Currently, I am using
>>> Maven to generate a shaded package with all the dependencies. You may try
>>> to use sbt assembly to include the dependencies in your jar file.
>>>
>>>  Bill
>>>
>>>
>>> On Thu, Jul 10, 2014 at 11:48 PM, Dilip 
>>> wrote:
>>>
  Hi Akhil,

 Can you please guide me through this? Because the code I am running
 already has this in it:
 [java]

 SparkContext sc = new SparkContext();

 sc.addJar("/usr/local/spark/external/kafka/target/scala-2.10/spark-streaming-kafka_2.10-1.1.0-SNAPSHOT.jar");


 Is there something I am missing?

 Thanks,
 Dilip


 On Friday 11 July 2014 12:02 PM, Akhil Das wrote:

  Easiest fix would be adding the kafka jars to the SparkContext while
 creating it.

  Thanks
 Best Regards


 On Fri, Jul 11, 2014 at 4:39 AM, Dilip 
 wrote:

> Hi,
>
> I am trying to run a program with spark streaming using Kafka on a
> stand alone system. These are my details:
>
> Spark 1.0.0 hadoop2
> Scala 2.10.3
>
> I am trying a simple program using my custom sbt project but this is
> the error I am getting:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> kafka/serializer/StringDecoder
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:55)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:94)
> at
> org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
> at SimpleJavaApp.main(SimpleJavaApp.java:40)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException:
> kafka.serializer.StringDecoder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 11 more
>
>
> here is my .sbt file:
>
> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion := "2.10.3"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
>
> libraryDependencies += "org.apache.spark" %% "spark-streaming" %
> "1.0.0"
>
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"
>
> libraryDependencies += "org.apache.spark" %% "spark-examples" % "1.0.0"
>
> libraryDependencies += "org.apache.spark" %
> "spark-streaming-kafka_2.10" % "1.0.0"
>
> libraryDependencies += "org.apache.kafka" %% "kafka" % "0.8.0"
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>
> resolvers += "Maven Repository" at "http://central.maven.org/maven2/";
>
>
> sbt package was successful. I also tried sbt "++2.10.3 package" to
> build it for my scala version. Problem remains the same.
> Can anyone

Possible bug in ClientBase.scala?

2014-07-13 Thread Ron Gonzalez
Hi,
  I was doing programmatic submission of Spark yarn jobs and I saw code in 
ClientBase.getDefaultYarnApplicationClasspath():

val field = classOf[MRJobConfig].getField("DEFAULT_YARN_APPLICATION_CLASSPATH)
MRJobConfig doesn't have this field so the created launch env is incomplete. 
Workaround is to set yarn.application.classpath with the value from 
YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH.

This results in having the spark job hang if the submission config is different 
from the default config. For example, if my resource manager port is 8050 
instead of 8030, then the spark app is not able to register itself and stays in 
ACCEPTED state.

I can easily fix this by changing this to YarnConfiguration instead of 
MRJobConfig but was wondering what the steps are for submitting a fix.

Thanks,
Ron

Sent from my iPhone

Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Nicholas Chammas
On Sun, Jul 13, 2014 at 9:49 PM, Ron Gonzalez  wrote:

> I can easily fix this by changing this to YarnConfiguration instead of
> MRJobConfig but was wondering what the steps are for submitting a fix.
>

Relevant links:

   - https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
   - https://github.com/apache/spark/pulls

Nick


Re: not getting output from socket connection

2014-07-13 Thread Walrus theCat
Hah, thanks for tidying up the paper trail here, but I was the OP (and
solver) of the recent "reduce" thread that ended in this solution.


On Sun, Jul 13, 2014 at 4:26 PM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> Make sure you use "local[n]" (where n > 1) in your context setup too, (if
> you're running locally), or you won't get output.
>
>
> On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat 
> wrote:
>
>> Thanks!
>>
>> I thought it would get "passed through" netcat, but given your email, I
>> was able to follow this tutorial and get it to work:
>>
>>
>> http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html
>>
>>
>>
>>
>> On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen  wrote:
>>
>>> netcat is listening for a connection on port . It is echoing what
>>> you type to its console to anything that connects to  and reads.
>>> That is what Spark streaming does.
>>>
>>> If you yourself connect to  and write, nothing happens except that
>>> netcat echoes it. This does not cause Spark to somehow get that data.
>>> nc is only echoing input from the console.
>>>
>>> On Fri, Jul 11, 2014 at 9:25 PM, Walrus theCat 
>>> wrote:
>>> > Hi,
>>> >
>>> > I have a java application that is outputting a string every second.
>>>  I'm
>>> > running the wordcount example that comes with Spark 1.0, and running
>>> nc -lk
>>> > . When I type words into the terminal running netcat, I get counts.
>>> > However, when I write the String onto a socket on port , I don't
>>> get
>>> > counts.  I can see the strings showing up in the netcat terminal, but
>>> no
>>> > counts from Spark.  If I paste in the string, I get counts.
>>> >
>>> > Any ideas?
>>> >
>>> > Thanks
>>>
>>
>>
>


Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread kytay
Hi Akhil Das

Thanks.

I tried the codes. and it works.

There's a problem with my socket codes that is not flushing the content out,
and for the test tool, Hercules, I have to close the socket connection to
"flush" the content out.

I am going to troubleshoot why nc works, and the codes and test tool don't.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-socketTextStream-to-receive-anything-tp9382p9576.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Chester Chen
Ron,
Which distribution and Version of Hadoop are you using ?

 I just looked at CDH5 (  hadoop-mapreduce-client-core-
2.3.0-cdh5.0.0),

MRJobConfig does have the field :

java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

Chester



On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez  wrote:

> Hi,
>   I was doing programmatic submission of Spark yarn jobs and I saw code in
> ClientBase.getDefaultYarnApplicationClasspath():
>
> val field =
> classOf[MRJobConfig].getField("DEFAULT_YARN_APPLICATION_CLASSPATH)
> MRJobConfig doesn't have this field so the created launch env is
> incomplete. Workaround is to set yarn.application.classpath with the value
> from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH.
>
> This results in having the spark job hang if the submission config is
> different from the default config. For example, if my resource manager port
> is 8050 instead of 8030, then the spark app is not able to register itself
> and stays in ACCEPTED state.
>
> I can easily fix this by changing this to YarnConfiguration instead of
> MRJobConfig but was wondering what the steps are for submitting a fix.
>
> Thanks,
> Ron
>
> Sent from my iPhone


Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
For example, are LIKE 'string%' queries supported? Trying one on 1.0.1
yields java.lang.ExceptionInInitializerError.

Nick
​


On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas 
wrote:

> Is there a place where we can find an up-to-date list of supported SQL
> syntax in Spark SQL?
>
> Nick
>
>
> --
> View this message in context: Supported SQL syntax in Spark SQL
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Anaconda Spark AMI

2014-07-13 Thread Jeremy Freeman
Hi Ben,

This is great! I just spun up an EC2 cluster and tested basic pyspark  + 
ipython/numpy/scipy functionality, and all seems to be working so far. Will let 
you know if any issues arise.

We do a lot with pyspark + scientific computing, and for EC2 usage I think this 
is a terrific way to get the core libraries installed.

-- Jeremy

On Jul 12, 2014, at 4:25 PM, Benjamin Zaitlen  wrote:

> Hi All,
> 
> Thanks to Jey's help, I have a release AMI candidate for 
> spark-1.0/anaconda-2.0 integration.  It's currently limited to availability 
> in US-EAST: ami-3ecd0c56
> 
> Give it a try if you have some time.  This should just work with spark 1.0:
> 
> ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa  -a ami-3ecd0c56
> 
> If you have suggestions or run into trouble please email,
> 
> --Ben
> 
> PS:  I found that writing a noop map function is a decent way to install pkgs 
> on worker nodes (though most scientific pkgs are pre-installed with anaconda:
> 
> def subprocess_noop(x):
> import os
> os.system("/opt/anaconda/bin/conda install h5py") 
> return 1
> 
> install_noop = rdd.map(subprocess_noop)
> install_noop.count()
> 
> 
> On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam  wrote:
> Hi Ben,
> 
> Has the PYSPARK_PYTHON environment variable been set in
> spark/conf/spark-env.sh to the path of the new python binary?
> 
> FYI, there's a /root/copy-dirs script that can be handy when updating
> files on an already-running cluster. You'll want to restart the spark
> cluster for the changes to take effect, as described at
> https://spark.apache.org/docs/latest/ec2-scripts.html
> 
> Hope that helps,
> -Jey
> 
> On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen  wrote:
> > Hi All,
> >
> > I'm a dev a Continuum and we are developing a fair amount of tooling around
> > Spark.  A few days ago someone expressed interest in numpy+pyspark and
> > Anaconda came up as a reasonable solution.
> >
> > I spent a number of hours yesterday trying to rework the base Spark AMI on
> > EC2 but sadly was defeated by a number of errors.
> >
> > Aggregations seemed to choke -- where as small takes executed as aspected
> > (errors are linked to the gist):
> >
>  sc.appName
> > u'PySparkShell'
>  sc._conf.getAll()
> > [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
> > (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
> > (u'spark.app.name', u'
> > PySparkShell'), (u'spark.executor.extraClassPath',
> > u'/root/ephemeral-hdfs/conf'), (u'spark.master',
> > u'spark://.compute-1.amazonaws.com:7077')]
>  file = sc.textFile("hdfs:///user/root/chekhov.txt")
>  file.take(2)
> > [u"Project Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov",
> > u'']
> >
>  lines = file.filter(lambda x: len(x) > 0)
>  lines.count()
> > VARIOUS ERROS DISCUSSED BELOW
> >
> > My first thought was that I could simply get away with including anaconda on
> > the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
> > Doing so resulted in some strange py4j errors like the following:
> >
> > Py4JError: An error occurred while calling o17.partitions. Trace:
> > py4j.Py4JException: Method partitions([]) does not exist
> >
> > At some point I also saw:
> > SystemError: Objects/cellobject.c:24: bad argument to internal function
> >
> > which is really strange, possibly the result of a version mismatch?
> >
> > I had another thought of building spark from master on the AMI, leaving the
> > spark directory in place, and removing the spark call from the modules list
> > in spark-ec2 launch script. Unfortunately, this resulted in the following
> > errors:
> >
> > https://gist.github.com/quasiben/da0f4778fbc87d02c088
> >
> > If a spark dev was willing to make some time in the near future, I'm sure
> > she/he and I could sort out these issues and give the Spark community a
> > python distro ready to go for numerical computing.  For instance, I'm not
> > sure how pyspark calls out to launching a python session on a slave?  Is
> > this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
> > point to my anaconda bin directory so it shouldn't really matter.  Is there
> > something special about the py4j zip include in spark dir compared with the
> > py4j in pypi?
> >
> > Thoughts?
> >
> > --Ben
> >
> >
> 



Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Actually, this looks like its some kind of regression in 1.0.1, perhaps
related to assembly and packaging with spark-ec2. I don’t see this issue
with the same data on a 1.0.0 EC2 cluster.

How can I trace this down for a bug report?

Nick
​


On Sun, Jul 13, 2014 at 11:18 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> For example, are LIKE 'string%' queries supported? Trying one on 1.0.1
> yields java.lang.ExceptionInInitializerError.
>
> Nick
> ​
>
>
> On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas  > wrote:
>
>> Is there a place where we can find an up-to-date list of supported SQL
>> syntax in Spark SQL?
>>
>> Nick
>>
>>
>> --
>> View this message in context: Supported SQL syntax in Spark SQL
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>


Catalyst dependency on Spark Core

2014-07-13 Thread Aniket Bhatnagar
As per the recent presentation given in Scala days (
http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was
mentioned that Catalyst is independent of Spark. But on inspecting pom.xml
of sql/catalyst module, it seems it has a dependency on Spark Core. Any
particular reason for the dependency? I would love to use Catalyst outside
Spark

(reposted as previous email bounced. Sorry if this is a duplicate).


Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Michael Armbrust
Are you sure the code running on the cluster has been updated?  We recently
optimized the execution of LIKE queries that can be evaluated without using
full regular expressions.  So it's possible this error is due to missing
functionality on the executors.

> How can I trace this down for a bug report?
>
If the above doesn't fix it, the following would be helpful:
 - The full stack trace
 - The queryExecution from the SchemaRDD (i.e. println(sql("SELECT
...").queryExecution))


Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread Tobias Pfeiffer
Hi,

I experienced exactly the same problems when using SparkContext with
"local[1]" master specification, because in that case one thread is used
for receiving data, the others for processing. As there is only one thread
running, no processing will take place. Once you shut down the connection,
the receiver thread will be used for processing.

Any chance you run into the same issue?

Tobias



On Mon, Jul 14, 2014 at 11:45 AM, kytay  wrote:

> Hi Akhil Das
>
> Thanks.
>
> I tried the codes. and it works.
>
> There's a problem with my socket codes that is not flushing the content
> out,
> and for the test tool, Hercules, I have to close the socket connection to
> "flush" the content out.
>
> I am going to troubleshoot why nc works, and the codes and test tool don't.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-socketTextStream-to-receive-anything-tp9382p9576.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Hi

I'm trying to read lzo compressed files from S3 using spark. The lzo files
are not indexed. Spark job starts to read the files just fine but after a
while it just hangs. No network throughput. I have to restart the worker
process to get it back up. Any idea what could be causing this. We were
using uncompressed files before and that worked just fine, went with the
compression to reduce S3 storage. 

Any help would be appreciated. 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-S3-LZO-input-worker-stuck-tp9584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Interestingly if I don't cache the data it works. However, as I need to
re-use the data to apply different kinds of filtering it really slows down
the job as it needs to read from S3 again and again. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-S3-LZO-input-worker-stuck-tp9584p9585.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Are you sure the code running on the cluster has been updated?

I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m
assuming that’s taken care of, at least in theory.

I just spun down the clusters I had up, but I will revisit this tomorrow
and provide the information you requested.

Nick
​


Re: Powered By Spark: Can you please add our org?

2014-07-13 Thread Alex Gaudio
Awesome!  Thanks, Reynold!


On Tue, Jul 8, 2014 at 4:00 PM, Reynold Xin  wrote:

> I added you to the list. Cheers.
>
>
>
> On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio  wrote:
>
>> Hi,
>>
>> Sailthru is also using Spark.  Could you please add us to the Powered By
>> Spark
>>  page
>> when you have a chance?
>>
>> Organization Name: Sailthru
>> URL: www.sailthru.com
>> Short Description: Our data science platform uses Spark to build
>> predictive models and recommendation systems for marketing automation and
>> personalization
>>
>>
>> Thank you!
>> Alex
>>
>
>


Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread kytay
Hi Tobias

I have been using "local[4]" to test.
My problem is likely caused by the tcp host server that I am trying the
emulate. I was trying to emulate the tcp host to send out messages.
(although I am not sure at the moment :D)

First way I tried was to use a tcp tool called, Hercules.

Second way was to write a simple socket code to send message at interval.
Like the one shown in #2 of my first post. I suspect the reason why it don't
work is due the messages are not "flush" so no message was received on Spark
Streaming.

I think I will need to do more testing to understand the behavior. I am
currently not sure why "nc -lk" is working, and not the other tools or codes
I am testing with.

Regards.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Solved-Streaming-Cannot-get-socketTextStream-to-receive-anything-tp9382p9588.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KMeans code is rubbish

2014-07-13 Thread Wanda Hawk
The problem is that I get the same results every time


On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar  wrote:
 


Hi Wanda,

As Sean mentioned, K-means is not guaranteed to find an optimal answer, even 
for seemingly simple toy examples. A common heuristic to deal with this issue 
is to run kmeans multiple times and choose the best answer.  You can do this by 
changing the runs parameter from the default value (1) to something larger (say 
10).

-Ameet



On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk  wrote:

I also took a look at 
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.
>
>
>There is an issue here:
>"    val initMode = params.initializationMode match {
>      case Random => KMeans.RANDOM
>      case Parallel => KMeans.K_MEANS_PARALLEL
>    }
>"
>
>
>If I use initMode=KMeans.RANDOM everything is ok.
>If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know 
>why. The example proposed is a really simple one that should not accept 
>multiple solutions and always converge to the correct one.
>
>
>Now what can be altered in the original SparkKMeans.scala (the seed or 
>something else ?) to get the correct results each and every single time ?
>On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng  wrote:
> 
>
>
>SparkKMeans is a naive implementation. Please use
>mllib.clustering.KMeans in practice. I created a JIRA for this:
>https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
>
>
>On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
> wrote:
>> I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
>> dataset as well, I got the expected answer. And I believe that even though
>> initialization is done using sampling, the example actually sets the seed to
>> a constant 42, so the result should always be the same no matter how many
>> times you run it. So I am not really sure whats going on here.
>>
>> Can you tell us more about which version of Spark you are running? Which
>> Java version?
>>
>>
>> ==
>>
>> [tdas @ Xion spark2] cat input
>> 2 1
>> 1 2
>> 3 2
>> 2 3
>> 4 1
>> 5 1
>> 6 1
>> 4 2
>> 6 2
>> 4 3
>> 5 3
>> 6 3
>> [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
>> 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info
 from
>> SCDynamicStore
>> 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 14/07/10 02:45:07 WARN LoadSnappy:
 Snappy native library not loaded
>> 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
>> com.github.fommil.netlib.NativeSystemBLAS
>> 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
>> com.github.fommil.netlib.NativeRefBLAS
>> Finished iteration (delta = 3.0)
>> Finished iteration (delta = 0.0)
>> Final centers:
>> DenseVector(5.0, 2.0)
>> DenseVector(2.0, 2.0)
>>
>>
>>
>> On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk  wrote:
>>>
>>> so this is what I am running:
>>> "./bin/run-example SparkKMeans
 ~/Documents/2dim2.txt 2 0.001"
>>>
>>> And this is the input file:"
>>> ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
>>> └───#!cat ~/Documents/2dim2.txt
>>> 2 1
>>> 1 2
>>> 3 2
>>> 2 3
>>> 4 1
>>> 5 1
>>> 6 1
>>> 4 2
>>> 6 2
>>> 4 3
>>> 5 3
>>> 6 3
>>> "
>>>
>>> This is the final output from spark:
>>> "14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>>> Getting 2 non-empty blocks
 out of 2 blocks
>>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>>> Started 0 remote fetches in 0 ms
>>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>>> maxBytesInFlight: 50331648, targetRequestSize: 10066329
>>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>>> Getting 2 non-empty blocks out of 2 blocks
>>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>>> Started 0 remote fetches in 0 ms
>>> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
>>> 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
>>> 14/07/10 20:05:12 INFO Executor: Finished task ID 14
>>>
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
>>> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
>>> localhost (progress: 1/2)
>>> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
>>> 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
>>> 14/07/10 20:05:12 INFO Executor: Finished task ID 15
>>> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
>>> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
>>> localhost (progress: 2/2)
>>> 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
>>> SparkKMeans.scala:75) finished in 0.008 s
>>> 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
>>> have all completed, from pool
>>> 14/07/10 20

mapPartitionsWithIndex

2014-07-13 Thread Madhura
I have a text file consisting of a large number of random floating values
separated by spaces. I am loading this file into a RDD in scala.

I have heard of mapPartitionsWithIndex but I haven't been able to implement
it. For each partition I want to call a method(process in this case) to
which I want to pass the partition and it's respective index as parameters.

My method returns a pair of values.
This is what I have done.

val dRDD = sc.textFile("hdfs://master:54310/Data/input*")
var ind:Int=0
val keyval= dRDD.mapPartitionsWithIndex((ind,x) => process(ind,x,...))
val res=keyval.collect()
 
We are not able to access res(0)._1 and res(0)._2

The error log is as follows.

[error] SimpleApp.scala:420: value trim is not a member of Iterator[String]
[error] Error occurred in an application involving default arguments.
[error] val keyval=dRDD.mapPartitionsWithIndex( (ind,x) =>
process(ind,x.trim().split(' ').map(_.toDouble),q,m,r))
[error] 
^
[error] SimpleApp.scala:425: value mkString is not a member of
Array[Nothing]
[error]   println(res.mkString(""))
[error]   ^
[error] /SimpleApp.scala:427: value _1 is not a member of Nothing
[error]   var final= res(0)._1
[error] ^
[error] /home/madhura/DTWspark/src/main/scala/SimpleApp.scala:428: value _2
is not a member of Nothing
[error]   var final1 = res(0)._2 - m +1
[error]  ^




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitionsWithIndex-tp9590.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.