Re: Bug in Accumulators...

2014-11-23 Thread Aaron Davidson
As Mohit said, making Main extend Serializable should fix this example. In
general, it's not a bad idea to mark the fields you don't want to serialize
(e.g., sc and conf in this case) as @transient as well, though this is not
the issue in this case.

Note that this problem would not have arisen in your very specific example
if you used a while loop instead of a for-each loop, but that's really more
of a happy coincidence than something you should rely on, as nested lambdas
are virtually unavoidable in Scala.

On Sat, Nov 22, 2014 at 5:16 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 perhaps the closure ends up including the main object which is not
 defined as serializable...try making it a case object or object main
 extends Serializable.


 On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote:

 I posted several examples in java at http://lordjoesoftware.blogspot.com/

 Generally code like this works and I show how to accumulate more complex
 values.

 // Make two accumulators using Statistics
  final AccumulatorInteger totalLetters= ctx.accumulator(0L,
 ttl);
  JavaRDDstring lines = ...

 JavaRDDstring words = lines.flatMap(new FlatMapFunctionString,
 String() {
 @Override
 public Iterablestring call(final String s) throws Exception
 {
 // Handle accumulator here
 totalLetters.add(s.length()); // count all letters
 
  });
 
  Long numberCalls = totalCounts.value();

 I believe the mistake is to pass the accumulator to the function rather
 than
 letting the function find the accumulator - I do this in this case by
 using
 a final local variable



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p19579.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Bug in Accumulators...

2014-11-23 Thread Sean Owen
Here, the Main object is not meant to be serialized. transient ought
to be for fields that are within an object that is legitimately
supposed to be serialized, but, whose value can be recreated on
deserialization. I feel like marking objects that aren't logically
Serializable as such is a hack, and transient extend that hack, and
will cause surprises later.

Hack away for toy examples but ideally the closure cleaner would snip
whatever phantom reference is at work here. I usually try to rewrite
the Scala as you say to avoid the issue rather than make things
Serializable ad hoc.

On Sun, Nov 23, 2014 at 10:49 AM, Aaron Davidson ilike...@gmail.com wrote:
 As Mohit said, making Main extend Serializable should fix this example. In
 general, it's not a bad idea to mark the fields you don't want to serialize
 (e.g., sc and conf in this case) as @transient as well, though this is not
 the issue in this case.

 Note that this problem would not have arisen in your very specific example
 if you used a while loop instead of a for-each loop, but that's really more
 of a happy coincidence than something you should rely on, as nested lambdas
 are virtually unavoidable in Scala.

 On Sat, Nov 22, 2014 at 5:16 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 perhaps the closure ends up including the main object which is not
 defined as serializable...try making it a case object or object main
 extends Serializable.


 On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote:

 I posted several examples in java at http://lordjoesoftware.blogspot.com/

 Generally code like this works and I show how to accumulate more complex
 values.

 // Make two accumulators using Statistics
  final AccumulatorInteger totalLetters= ctx.accumulator(0L,
 ttl);
  JavaRDDstring lines = ...

 JavaRDDstring words = lines.flatMap(new FlatMapFunctionString,
 String() {
 @Override
 public Iterablestring call(final String s) throws Exception
 {
 // Handle accumulator here
 totalLetters.add(s.length()); // count all letters
 
  });
 
  Long numberCalls = totalCounts.value();

 I believe the mistake is to pass the accumulator to the function rather
 than
 letting the function find the accumulator - I do this in this case by
 using
 a final local variable



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p19579.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark serialization issues with third-party libraries

2014-11-23 Thread jatinpreet
Thanks Sean, I was actually using instances created elsewhere inside my RDD
transformations which as I understand is against Spark programming model. I
was referred to a talk about UIMA and Spark integration from this year's
Spark summit, which had a workaround for this problem. I just had to make
some class members transient.

http://spark-summit.org/2014/talk/leveraging-uima-in-spark

Thanks



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively 
before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis.
To get me jumpstarted into Spark I started this gitHub where there is 
IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will 
keep adding to that. I dont know scala and I am learning that too to help me 
use Spark better.https://github.com/sanjaysubramanian/msfx_scala.git

Philosophically speaking its possibly not a good idea to take an either/or 
approach to technology...Like its never going to be either RDBMS or NOSQL (If 
the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a 
day for some reason u may not be as upset...but if the Oracle/Db2 systems 
behind Wells Fargo show $100 LESS in your account due to an database error, you 
will be PANIC-ing).

So its the same case with Spark or Hadoop. I can speak for myself. I have a 
usecase for processing old logs that are multiline (i.e. they have a 
[begin_timestamp_logid] and [end_timestamp_logid] and have many lines in  
between. In Java Hadoop I created custom RecordReaders to solve this. I still 
dont know how to do this in Spark. Till that time I am possibly gonna run the 
Hadoop code within Oozie in production. 
Also my current task is evangelizing Big Data at my company. So the tech people 
I can educate with Hadoop and Spark and they would learn that but not the 
business intelligence analysts. They love SQL so I have to educate them using 
Hive , Presto, Impala...so the question is what is your task or tasks ?

Sorry , a long non technical answer to your question...
Make sense ?
sanjay     
  From: Krishna Sankar ksanka...@gmail.com
 To: Sean Owen so...@cloudera.com 
Cc: Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org 
 Sent: Saturday, November 22, 2014 4:53 PM
 Subject: Re: Spark or MR, Scala or Java?
   
Adding to already interesting answers:   
   - Is there any case where MR is better than Spark? I don't know what cases 
I should be used Spark by MR. When is MR faster than Spark?   

   
   - Many. MR would be better (am not saying faster ;o)) for 
   
   - Very large dataset,
   - Multistage map-reduce flows,
   - Complex map-reduce semantics
   
   - Spark is definitely better for the classic iterative,interactive workloads.
   - Spark is very effective for implementing the concepts of in-memory 
datasets  real time analytics 
   
   - Take a look at the Lambda architecture
   
   - Also checkout how Ooyala is using Spark in multiple layers  
configurations. They also have MR in many places
   - In our case, we found Spark very effective for ELT - we would have used MR 
earlier
   
   -  I know Java, is it worth it to learn Scala for programming to Spark or 
it's okay just with Java?   

   
   - Java will work fine. Especially when Java 8 becomes the norm, we will get 
back some of the elegance
   - I, personally, like Scala  Python lot better than Java. Scala is a lot 
more elegant, but compilations, IDE integration et al are still clunky
   - One word of caution - stick with one language as much as 
possible-shuffling between Java  Scala is not fun
Cheers  HTHk/


On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote:

MapReduce is simpler and narrower, which also means it is generally lighter 
weight, with less to know and configure, and runs more predictably. If you have 
a job that is truly just a few maps, with maybe one reduce, MR will likely be 
more efficient. Until recently its shuffle has been more developed and offers 
some semantics the Spark shuffle does not.I suppose it integrates with tools 
like Oozie, that Spark does not. I suggest learning enough Scala to use Spark 
in Scala. The amount you need to know is not large.(Mahout MR based 
implementations do not run on Spark and will not. They have been removed 
instead.)On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com 
wrote:

Hello,

I'm a newbie with Spark but I've been working with Hadoop for a while.
I have two questions.

Is there any case where MR is better than Spark? I don't know what
cases I should be used Spark by MR. When is MR faster than Spark?

The other question is, I know Java, is it worth it to learn Scala for
programming to Spark or it's okay just with Java? I have done a little
piece of code with Java because I feel more confident with it,, but I
seems that I'm missed something

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org






  

Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread riginos
Hi guys ,
Im trying to do the Spark SQL Programming Guide but after the:

case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people =
sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt)) 

im issuing:
people.registerTempTable(people)
 console:20: error: value registerTempTable is not a member of
org.apache.spark.rdd.RDD[Person]
  people.registerTempTable(people)

why is that what I'm i doing wrong?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ashish Rangole
This being a very broad topic, a discussion can quickly get subjective.
I'll try not to deviate from my experiences and observations to keep this
thread useful to those looking for answers.

I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as
well as Spark (since v 0.6) in Scala. I learnt Scala for using Spark. My
observations are below.

Spark and Hadoop MR:
1. There doesn't have to be a dichotomy between Hadoop ecosystem and Spark
since Spark is a part of it.

2. Spark or Hadoop MR, there is no getting away from learning how
partitioning, input splits, and shuffle process work. In order to optimize
performance, troubleshoot and design software one must know these. I
recommend reading first 6-7 chapters of Hadoop The definitive Guide book
to develop initial understanding. Indeed knowing a couple of divide and
conquer algorithms is a pre-requisite and I assume everyone on this mailing
list is very familiar :)

3. Having used a lot of different APIs and layers of abstraction for Hadoop
MR, my experience progressing from MR Java API -- Cascading -- Scalding
is that each new API looks simpler than the previous one. However, Spark
API and abstraction has been simplest. Not only for me but those who I have
seen start with Hadoop MR or Spark first. It is easiest to get started and
become productive with Spark with the exception of Hive for those who are
already familiar with SQL. Spark's ease of use is critical for teams
starting out with Big Data.

4. It is also extremely simple to chain multi-stage jobs in Spark, you do
it without even realizing by operating over RDDs. In Hadoop MR, one has to
handle it explicitly.

5. Spark has built-in support for graph algorithms (including Bulk
Synchronous Parallel processing BSP algorithms e.g. Pregel), Machine
Learning and Stream processing. In Hadoop MR you need a separate
library/Framework for each and it is non-trivial to combine multiple of
these in the same application. This is huge!

6. In Spark one does have to learn how to configure the memory and other
parameters of their cluster. Just to be clear, similar parameters exist in
MR as well (e.g. shuffle memory parameters) but you don't *have* to learn
about tuning them until you have jobs with larger data size jobs. In Spark
you learn this by reading the configuration and tuning documentation
followed by experimentation. This is an area of Spark where things can be
better.

Java or Scala : I knew Java already yet I learnt Scala when I came across
Spark. As others have said, you can get started with a little bit of Scala
and learn more as you progress. Once you have started using Scala for a few
weeks you would want to stay with it instead of going back to Java. Scala
is arguably more elegant and less verbose than Java which translates into
higher developer productivity and more maintainable code.

Myth: Spark is for in-memory processing *only*. This is a common beginner
misunderstanding.

Sanjay: Spark uses Hadoop API for performing I/O from file systems (local,
HDFS, S3 etc). Therefore you can use the same Hadoop InputFormat and
RecordReader with Spark that you use with Hadoop for your multi-line record
format. See SparkContext APIs. Just like Hadoop, you will need to make sure
that your files are split at record boundaries.

Hope this is helpful.


On Sun, Nov 23, 2014 at 8:35 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com.invalid wrote:

 I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming
 extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on
 a daily basis.

 To get me jumpstarted into Spark I started this gitHub where there is
 IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I
 will keep adding to that. I dont know scala and I am learning that too to
 help me use Spark better.
 https://github.com/sanjaysubramanian/msfx_scala.git

 Philosophically speaking its possibly not a good idea to take an either/or
 approach to technology...Like its never going to be either RDBMS or NOSQL
 (If the Cassandra behind FB shows 100 fewer likes instead of 1000 on you
 Photo a day for some reason u may not be as upset...but if the Oracle/Db2
 systems behind Wells Fargo show $100 LESS in your account due to an
 database error, you will be PANIC-ing).


 So its the same case with Spark or Hadoop. I can speak for myself. I have
 a usecase for processing old logs that are multiline (i.e. they have a
 [begin_timestamp_logid] and [end_timestamp_logid] and have many lines in
  between. In Java Hadoop I created custom RecordReaders to solve this. I
 still dont know how to do this in Spark. Till that time I am possibly gonna
 run the Hadoop code within Oozie in production.

 Also my current task is evangelizing Big Data at my company. So the tech
 people I can educate with Hadoop and Spark and they would learn that but
 not the business intelligence analysts. They love SQL so I have to educate
 them using Hive , Presto, Impala...so the question is 

Spark Streaming with Python

2014-11-23 Thread Venkat, Ankam
I am trying to run network_wordcount.py example mentioned at

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py

on CDH5.2 Quickstart VM.   Getting below error.

Traceback (most recent call last):
  File /usr/lib/spark/examples/lib/network_wordcount.py, line 4, in module
from pyspark.streaming import StreamingContext
ImportError: No module named streaming.

How to resolve this?

Regards,
Venkat
This communication is the property of CenturyLink and may contain confidential 
or privileged information. Unauthorized use of this communication is strictly 
prohibited and may be unlawful. If you have received this communication in 
error, please immediately notify the sender by reply e-mail and destroy all 
copies of the communication and any attachments.


Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
By any chance are you using Spark 1.0.2?  registerTempTable was introduced
from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable.

On Sun Nov 23 2014 at 10:59:48 AM riginos samarasrigi...@gmail.com wrote:

 Hi guys ,
 Im trying to do the Spark SQL Programming Guide but after the:

 case class Person(name: String, age: Int)
 // Create an RDD of Person objects and register it as a table.
 val people =
 sc.textFile(examples/src/main/resources/people.txt).
 map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt))

 im issuing:
 people.registerTempTable(people)
  console:20: error: value registerTempTable is not a member of
 org.apache.spark.rdd.RDD[Person]
   people.registerTempTable(people)

 why is that what I'm i doing wrong?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-
 tp19591.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread riginos
That was the problem ! Thank you Denny for your fast response!
Another quick question:
Is there any way to update spark to 1.1.0 fast?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Python Logistic Regression error

2014-11-23 Thread Venkat, Ankam
Can you please suggest sample data for running the logistic_regression.py?

I am trying to use a sample data file at  
https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt

I am running this on CDH5.2 Quickstart VM.

[cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3

But, getting below error.

14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; 
aborting job
14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool
14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0
14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296
Traceback (most recent call last):
  File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 50, in 
module
model = LogisticRegressionWithSGD.train(points, iterations)
  File /usr/lib/spark/python/pyspark/mllib/classification.py, line 110, in 
train
initialWeights)
  File /usr/lib/spark/python/pyspark/mllib/_common.py, line 430, in 
_regression_train_wrapper
initial_weights = _get_initial_weights(initial_weights, data)
  File /usr/lib/spark/python/pyspark/mllib/_common.py, line 415, in 
_get_initial_weights
initial_weights = _convert_vector(data.first().features)
  File /usr/lib/spark/python/pyspark/rdd.py, line 1127, in first
rs = self.take(1)
  File /usr/lib/spark/python/pyspark/rdd.py, line 1109, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /usr/lib/spark/python/pyspark/context.py, line 770, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
javaPartitions, allowLocal)
  File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
line 538, in __call__
  File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, 192.168.139.145): org.apache.spark.api.python.PythonException: Traceback 
(most recent call last):
  File /usr/lib/spark/python/pyspark/worker.py, line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /usr/lib/spark/python/pyspark/serializers.py, line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File /usr/lib/spark/python/pyspark/serializers.py, line 127, in dump_stream
for obj in iterator:
  File /usr/lib/spark/python/pyspark/serializers.py, line 185, in _batched
for item in iterator:
  File /usr/lib/spark/python/pyspark/rdd.py, line 1105, in takeUpToNumLeft
yield next(iterator)
  File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 37, in 
parsePoint
values = [float(s) for s in line.split(' ')]
ValueError: invalid literal for float(): 1:0.4551273600657362

Regards,
Venkat
This communication is the property of CenturyLink and may contain confidential 
or privileged information. Unauthorized use of this communication is strictly 
prohibited and may be unlawful. If you have received this communication in 
error, please immediately notify the sender by reply e-mail and destroy all 
copies of the communication and any attachments.


Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ognen Duzlevski
On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came across
 Spark. As others have said, you can get started with a little bit of Scala
 and learn more as you progress. Once you have started using Scala for a few
 weeks you would want to stay with it instead of going back to Java. Scala
 is arguably more elegant and less verbose than Java which translates into
 higher developer productivity and more maintainable code.


Scala is arguably more elegant and less verbose than Java. However, Scala
is also a complex language with a lot of details and tidbits and one-offs
that you just have to remember.  It is sometimes difficult to make a
decision whether what you wrote is the using the language features most
effectively or if you missed out on an available feature that could have
made the code better or more concise. For Spark you really do not need to
know that much Scala but you do need to understand the essence of it.

Thanks for the good discussion! :-)
Ognen


Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
It sort of depends on your environment.  If you are running on your local
environment, I would just download the latest Spark 1.1 binaries and you'll
be good to go.  If its a production environment, it sort of depends on how
you are setup (e.g. AWS, Cloudera, etc.)

On Sun Nov 23 2014 at 11:27:49 AM riginos samarasrigi...@gmail.com wrote:

 That was the problem ! Thank you Denny for your fast response!
 Another quick question:
 Is there any way to update spark to 1.1.0 fast?




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-
 tp19591p19595.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Converting a column to a map

2014-11-23 Thread Daniel Haviv
Hi,
I have a column in my schemaRDD that is a map but I'm unable to convert it
to a map.. I've tried converting it to a Tuple2[String,String]:
val converted = jsonFiles.map(line= {
line(10).asInstanceOf[Tuple2[String,String]]})

but I get ClassCastException:
14/11/23 11:51:30 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0
(TID 2, localhost): java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to
scala.Tuple2

And if if convert it to Iterable[String] I can only get the values without
the keys.

What it the correct data type I should convert it to ?

Thanks,
Daniel


Creating a front-end for output from Spark/PySpark

2014-11-23 Thread Alaa Ali
Hello. Okay, so I'm working on a project to run analytic processing using
Spark or PySpark. Right now, I connect to the shell and execute my
commands. The very first part of my commands is: create an SQL JDBC
connection and cursor to pull from Apache Phoenix, do some processing on
the returned data, and spit out some output. I want to create a web gui
tool kind of a thing where I play around with what SQL query is executed
for my analysis.

I know that I can write my whole Spark program and use spark-submit and
have it accept and argument to be the SQL query I want to execute, but this
means that every time I submit: an SQL connection will be created, query
ran, processing done, output printed, program closes and SQL connection
closes, and then the whole thing repeats if I want to do another query
right away. That will probably cause it to be very slow. Is there a way
where I can somehow have the SQL connection working in the backend for
example, and then all I have to do is supply a query from my GUI tool where
it then takes it, runs it, displays the output? I just want to know the big
picture and a broad overview of how would I go about doing this and what
additional technology to use and I'll dig up the rest.

Regards,
Alaa Ali


Re: Error when Spark streaming consumes from Kafka

2014-11-23 Thread Bill Jay
Hi Dibyendu,

Thank you for answer. I will try the Spark-Kafka consumer.

Bill

On Sat, Nov 22, 2014 at 9:15 PM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:

 I believe this is something to do with how Kafka High Level API manages
 consumers within a Consumer group and how it re-balance during failure. You
 can find some mention in this Kafka wiki.

 https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

 Due to various issues in Kafka High Level APIs, Kafka is moving the High
 Level Consumer API to a complete new set of API in Kafka 0.9.

 Other than this co-ordination issue, High Level consumer also has data
 loss issues.

 You can probably try this Spark-Kafka consumer which uses Low Level Simple
 consumer API which is more performant and have no data loss scenarios.

 https://github.com/dibbhatt/kafka-spark-consumer

 Regards,
 Dibyendu

 On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I am using Spark to consume from Kafka. However, after the job has run
 for several hours, I saw the following failure of an executor:

 kafka.common.ConsumerRebalanceFailedException: 
 group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 
 can't rebalance after 4 retries
 
 kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
 
 kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)
 
 kafka.consumer.ZookeeperConsumerConnector.consume(ZookeeperConsumerConnector.scala:212)
 
 kafka.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:138)
 
 org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:114)
 
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
 
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
 
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
 
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)


 Does anyone know the reason for this exception? Thanks!

 Bill





Re: Creating a front-end for output from Spark/PySpark

2014-11-23 Thread Alex Kamil
Alaa,

one  option is to use Spark as a cache, importing subset of data from
hbase/phoenix that fits in memory, and using jdbcrdd to get more data on
cache miss. The front end can be created with pyspark and flusk, either as
rest api translating json requests to sparkSQL dialect, or simply allowing
the user to post sparkSql queries directly

On Sun, Nov 23, 2014 at 3:37 PM, Alaa Ali contact.a...@gmail.com wrote:

 Hello. Okay, so I'm working on a project to run analytic processing using
 Spark or PySpark. Right now, I connect to the shell and execute my
 commands. The very first part of my commands is: create an SQL JDBC
 connection and cursor to pull from Apache Phoenix, do some processing on
 the returned data, and spit out some output. I want to create a web gui
 tool kind of a thing where I play around with what SQL query is executed
 for my analysis.

 I know that I can write my whole Spark program and use spark-submit and
 have it accept and argument to be the SQL query I want to execute, but this
 means that every time I submit: an SQL connection will be created, query
 ran, processing done, output printed, program closes and SQL connection
 closes, and then the whole thing repeats if I want to do another query
 right away. That will probably cause it to be very slow. Is there a way
 where I can somehow have the SQL connection working in the backend for
 example, and then all I have to do is supply a query from my GUI tool where
 it then takes it, runs it, displays the output? I just want to know the big
 picture and a broad overview of how would I go about doing this and what
 additional technology to use and I'll dig up the rest.

 Regards,
 Alaa Ali



How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-23 Thread critikaled
Hi,
I am trying to insert particular set of data from rdd  to a hive table I
have Map[String,Map[String,Int]] in scala which I want to insert into the
table of mapstring,maplt;string,int I was able to create the table but
while inserting it says scala.MatchError:
MapType(StringType,MapType(StringType,IntegerType,true),true) (of class
org.apache.spark.sql.catalyst.types.MapType) can any one help me with this.
Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-string-map-string-int-in-spark-sql-tp19603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to keep a local variable in each cluster?

2014-11-23 Thread zh8788
Hi,

I am new to spark. This is the first time I am posting here. Currently, I
try to implement ADMM optimization algorithms for Lasso/SVM
Then I come across a problem:

Since the training data(label, feature) is large, so I created a RDD and
cached the training data(label, feature ) in memory.  Then for ADMM, it
needs to keep  local parameters (u,v) (which are different for each
partition ). For each iteration, I need to use the training data(only on
that partition), u, v to calculate the new value for u and v. 

Question1:

One way is to zip (training data, u, v) into a rdd and update it in each
iteration, but as we can see, training data is large and won't change for
the whole time, only u, v (is small) are changed in each iteration. If I zip
these three, I could not cache that rdd (since it changed for every
iteration). But if did not cache that, I need to reuse the training data
every iteration, how could I do it?

Question2:

Related to Question1, on the online documents, it said if we don't cache the
rdd, it  will not in the memory. And rdd uses delayed operation, then I am
confused when can I view a previous rdd in memroy.

Case1:

B = A.map(function1).
B.collect()#This forces B to be calculated ? After that, the node just
release B since it is not cached ???   
D = B.map(function3) 
D.collect()

Case2:
B = A.map(function1).
D = B.map(function3)   
D.collect()

Case3:

B = A.map(function1).
C = A.map(function2)
D = B.map(function3) 
D.collect()
 
In which case, can I view  B is in memory in each cluster when I calculate
D?

Question3:

can I use a function to do operations on two rdds? 

E.g   Function newfun(rdd1, rdd2)  
#rdd1 is large and do not change for the whole time (training data), which I
can use cache
#rdd2 is small and change in each iteration (u, v )


Questions4:

Or are there other ways to solve this kind of problem? I think this is
common problem, but I could not find any good solutions.


Thanks a lot

Han 











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Execute Spark programs from local machine on Yarn-hadoop cluster

2014-11-23 Thread Matt Narrell
I think this IS possible?

You must set the HADOOP_CONF_DIR variable on the machine you’re running the 
Java process that creates the SparkContext.  The Hadoop configuration specifies 
the YARN ResourceManager IPs, and Spark will use that configuration.  

mn

 On Nov 21, 2014, at 8:10 AM, Prannoy pran...@sigmoidanalytics.com wrote:
 
 Hi naveen, 
 
 I dont think this is possible. If you are setting the master with your 
 cluster details you cannot execute any job from your local machine. You have 
 to execute the jobs inside your yarn machine so that sparkconf is able to 
 connect with all the provided details. 
 
 If this is not the case such give a detail explaintation of what exactly you 
 are trying to do :)
 
 Thanks.
 
 On Fri, Nov 21, 2014 at 8:11 PM, Naveen Kumar Pokala [via Apache Spark User 
 List] [hidden email] 
 x-msg://4/user/SendEmail.jtp?type=nodenode=19484i=0 wrote:
 Hi,
 
  
 
 I am executing my spark jobs on yarn cluster by forming conf object in the 
 following way.
 
  
 
 SparkConf conf = new 
 SparkConf().setAppName(NewJob).setMaster(yarn-cluster);
 
  
 
 Now I want to execute spark jobs from my local machine how to do that.
 
  
 
 What I mean is there a way to give IP address, port all the details to 
 connect a master(YARN) on some other network from my local spark Program.
 
  
 
 -Naveen
 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482.html
 To start a new topic under Apache Spark User List, email [hidden email] 
 x-msg://4/user/SendEmail.jtp?type=nodenode=19484i=1 
 To unsubscribe from Apache Spark User List, click here 
 applewebdata://63F90BD1-F0D8-41D0-A1DE-B7ACE896B9D9.
 NAML 
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 View this message in context: Re: Execute Spark programs from local machine 
 on Yarn-hadoop cluster 
 http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482p19484.html
 Sent from the Apache Spark User List mailing list archive 
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



wholeTextFiles on 20 nodes

2014-11-23 Thread Simon Hafner
I have 20 nodes via EC2 and an application that reads the data via
wholeTextFiles. I've tried to copy the data into hadoop via
copyFromLocal, and I get

14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in
createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad
connect ack with firstBadLink as X:50010
14/11/24 02:00:07 INFO hdfs.DFSClient: Abandoning block
blk_-8725559184260876712_2627
14/11/24 02:00:07 INFO hdfs.DFSClient: Excluding datanode X:50010

a lot. Then I went the file system route via copy-dir, which worked
well. Now everything is under /root/txt on all nodes. I submitted the
job with the file:///root/txt/ directory for wholeTextFiles() and I
get

Exception in thread main java.io.FileNotFoundException: File does
not exist: /root/txt/3521.txt

The file exists on the root note and should be everywhere according to
copy-dir. The hadoop variant worked fine with 3 nodes, but it starts
bugging with 20. I added

  property
namedfs.datanode.max.transfer.threads/name
value4096/value
  /property

to hdfs-site.xml and core-site.xml, didn't help.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkSQL Timestamp query failure

2014-11-23 Thread Wang, Daoyuan
Hi,

I think you can try
 cast(l.timestamp as string)='2012-10-08 16:10:36.0'

Thanks,
Daoyuan

-Original Message-
From: whitebread [mailto:ale.panebia...@me.com] 
Sent: Sunday, November 23, 2014 12:11 AM
To: u...@spark.incubator.apache.org
Subject: Re: SparkSQL Timestamp query failure

Thanks for your answer Akhil, 

I have already tried that and the query actually doesn't fail but it doesn't 
return anything either as it should.
Using single quotes I think it reads it as a string and not as a timestamp. 

I don't know how to solve this. Any other hint by any chance?

Thanks,

Alessandro



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lots of small input files

2014-11-23 Thread Shixiong Zhu
We encountered similar problem. If all partitions are located in the same
node and all of the tasks run less than 3 seconds (set by
spark.locality.wait, the default value is 3000), the tasks will run in
the single node. Our solution is
using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some
big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it
may be not efficient because it still creates many tiny tasks.

Best Regards,
Shixiong Zhu

2014-11-22 17:17 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com:

 What is your cluster setup? are you running a worker on the master node
 also?

 1. Spark usually assigns the task to the worker who has the data locally
 available, If one worker has enough tasks, then i believe it will start
 assigning to others as well. You could control it with the level of
 parallelism and all.

 2. If you coalesce it into one partition, then i believe only one of the
 worker will execute the single task.

 Thanks
 Best Regards

 On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I have a job that searches for input recursively and creates a string of
 pathnames to treat as one input.

 The files are part-x files and they are fairly small. The job seems
 to take a long time to complete considering the size of the total data
 (150m) and only runs on the master machine. The job only does rdd.map type
 operations.

 1) Why doesn’t it use the other workers in the cluster?
 2) Is there a downside to using a lot of small part files? Should I
 coalesce them into one input file?
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Question about resource sharing in Spark Standalone

2014-11-23 Thread Patrick Liu
Dear all,


Currently, I am running spark standalone cluster with ~100 nodes.
Multiple users can connect to the cluster by Spark-shell or PyShell.
However, I can't find an efficient way to control the resources among multiple 
users.


I can set spark.deploy.defaultCores in the server side to limit the cpus for 
each Application.
But, I cannot limit the memory usage by ease Application.
User can set spark.executor.memory 10g spark.python.worker.memory 10g in 
./conf/spark-defaults.conf in client side. 
Which means they can control how many resources they have.


How can I control resources in the server side?


Meanwhile, the Fair-shceduler requires the user to the pool explicitly. What if 
the user don't set it?
Can I maintain the user-to-pool mapping in the server side?


Thanks!

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might not be as important, as the Spark framework mediates the
distributed computation. Even if there is some declarative part in Spark,
we can still choose an inefficient computation path that is not apparent to
the framework.
Cheers
k/
P.S: Now Reply to ALL

On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com
 wrote:

 On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com
 wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came across
 Spark. As others have said, you can get started with a little bit of Scala
 and learn more as you progress. Once you have started using Scala for a few
 weeks you would want to stay with it instead of going back to Java. Scala
 is arguably more elegant and less verbose than Java which translates into
 higher developer productivity and more maintainable code.


 Scala is arguably more elegant and less verbose than Java. However, Scala
 is also a complex language with a lot of details and tidbits and one-offs
 that you just have to remember.  It is sometimes difficult to make a
 decision whether what you wrote is the using the language features most
 effectively or if you missed out on an available feature that could have
 made the code better or more concise. For Spark you really do not need to
 know that much Scala but you do need to understand the essence of it.

 Thanks for the good discussion! :-)
 Ognen



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
Cheers
k/
P.S: Now reply to ALL.

On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Good point.
 On the positive side, whether we choose the most efficient mechanism in
 Scala might not be as important, as the Spark framework mediates the
 distributed computation. Even if there is some declarative part in Spark,
 we can still choose an inefficient computation path that is not apparent to
 the framework.
 Cheers
 k/
 P.S: Now Reply to ALL

 On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski 
 ognen.duzlev...@gmail.com wrote:

 On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com
 wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came
 across Spark. As others have said, you can get started with a little bit of
 Scala and learn more as you progress. Once you have started using Scala for
 a few weeks you would want to stay with it instead of going back to Java.
 Scala is arguably more elegant and less verbose than Java which translates
 into higher developer productivity and more maintainable code.


 Scala is arguably more elegant and less verbose than Java. However, Scala
 is also a complex language with a lot of details and tidbits and one-offs
 that you just have to remember.  It is sometimes difficult to make a
 decision whether what you wrote is the using the language features most
 effectively or if you missed out on an available feature that could have
 made the code better or more concise. For Spark you really do not need to
 know that much Scala but you do need to understand the essence of it.

 Thanks for the good discussion! :-)
 Ognen





Re: SparkSQL Timestamp query failure

2014-11-23 Thread Alessandro Panebianco
Hey Daoyuan, 

following your suggestion I obtain the same result as when I do:

where l.timestamp = '2012-10-08 16:10:36.0’

what happens using either your suggestion or simply using single quotes as I 
just typed in the example before is that the query does not fail but it doesn’t 
return anything either as it should.

If I do a simple :

SELECT timestamp FROM Logs limit 5).collect.foreach(println) 

I get: 

[2012-10-08 16:10:36.0]
[2012-10-08 16:10:36.0]
[2012-10-08 16:10:36.0]
[2012-10-08 16:10:41.0]
[2012-10-08 16:10:41.0]

that is why I am sure that putting one of those timestamps should not return an 
empty arrray.

Id really love to find a solution to this problem. Since Spark supports 
Timestamp it should provide simple comparison actions with them in my opinion.

Any other help would be greatly appreciated.

Alessandro


 

 On Nov 23, 2014, at 8:10 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote:
 
 Hi,
 
 I think you can try
 cast(l.timestamp as string)='2012-10-08 16:10:36.0'
 
 Thanks,
 Daoyuan
 
 -Original Message-
 From: whitebread [mailto:ale.panebia...@me.com] 
 Sent: Sunday, November 23, 2014 12:11 AM
 To: u...@spark.incubator.apache.org
 Subject: Re: SparkSQL Timestamp query failure
 
 Thanks for your answer Akhil, 
 
 I have already tried that and the query actually doesn't fail but it doesn't 
 return anything either as it should.
 Using single quotes I think it reads it as a string and not as a timestamp. 
 
 I don't know how to solve this. Any other hint by any chance?
 
 Thanks,
 
 Alessandro
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org
 



RE: SparkSQL Timestamp query failure

2014-11-23 Thread Cheng, Hao
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs 
LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone 
shifting problem.

(And it’s not mandatory, but you’d better change the field name from 
“timestamp” to something else, as “timestamp” is the keyword of data type in 
Hive/Spark SQL.)

From: Alessandro Panebianco [mailto:ale.panebia...@me.com]
Sent: Monday, November 24, 2014 11:12 AM
To: Wang, Daoyuan
Cc: u...@spark.incubator.apache.org
Subject: Re: SparkSQL Timestamp query failure

Hey Daoyuan,

following your suggestion I obtain the same result as when I do:

where l.timestamp = '2012-10-08 16:10:36.0’

what happens using either your suggestion or simply using single quotes as I 
just typed in the example before is that the query does not fail but it doesn’t 
return anything either as it should.

If I do a simple :

SELECT timestamp FROM Logs limit 5).collect.foreach(println)

I get:

[2012-10-08 16:10:36.0]
[2012-10-08 16:10:36.0]
[2012-10-08 16:10:36.0]
[2012-10-08 16:10:41.0]
[2012-10-08 16:10:41.0]

that is why I am sure that putting one of those timestamps should not return an 
empty arrray.

Id really love to find a solution to this problem. Since Spark supports 
Timestamp it should provide simple comparison actions with them in my opinion.

Any other help would be greatly appreciated.

Alessandro




On Nov 23, 2014, at 8:10 PM, Wang, Daoyuan 
daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote:

Hi,

I think you can try
cast(l.timestamp as string)='2012-10-08 16:10:36.0'

Thanks,
Daoyuan

-Original Message-
From: whitebread [mailto:ale.panebia...@me.com]
Sent: Sunday, November 23, 2014 12:11 AM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: Re: SparkSQL Timestamp query failure

Thanks for your answer Akhil,

I have already tried that and the query actually doesn't fail but it doesn't 
return anything either as it should.
Using single quotes I think it reads it as a string and not as a timestamp.

I don't know how to solve this. Any other hint by any chance?

Thanks,

Alessandro



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.comhttp://Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



2 spark streaming questions

2014-11-23 Thread tian zhang

Hi, Dear Spark Streaming Developers and Users,
We are prototyping using spark streaming and hit the following 2 issues thatI 
would like to seek your expertise.
1) We have a spark streaming application in scala, that reads  data from Kafka 
intoa DStream, does some processing and output a transformed DStream. If for 
some reasonthe Kafka connection is not available or timed out, the spark 
streaming job will startto send empty RDD afterwards. The log is clean w/o any 
ERROR indicator. I googled  around and this seems to be a known issue.We 
believe that spark streaming infrastructure should either retry or return 
error/exception.Can you share how you handle this case?
2) We would like implement a spark streaming job that join an 1 minute  
duration DStream of real time eventswith a metadata RDD that was read from a 
database. The metadata only changes slightly each day in the database.So what 
is the best practice of refresh the RDD daily keep the streaming join job 
running? Is this do-able as of spark 1.1.0?
Thanks.
Tian



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
Thanks a ton Ashishsanjay
  From: Ashish Rangole arang...@gmail.com
 To: Sanjay Subramanian sanjaysubraman...@yahoo.com 
Cc: Krishna Sankar ksanka...@gmail.com; Sean Owen so...@cloudera.com; 
Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org 
 Sent: Sunday, November 23, 2014 11:03 AM
 Subject: Re: Spark or MR, Scala or Java?
   
This being a very broad topic, a discussion can quickly get subjective. I'll 
try not to deviate from my experiences and observations to keep this thread 
useful to those looking for answers.
I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well 
as Spark (since v 0.6) in Scala. I learnt Scala for using Spark. My 
observations are below.
Spark and Hadoop MR:1. There doesn't have to be a dichotomy between Hadoop 
ecosystem and Spark since Spark is a part of it.
2. Spark or Hadoop MR, there is no getting away from learning how partitioning, 
input splits, and shuffle process work. In order to optimize performance, 
troubleshoot and design software one must know these. I recommend reading first 
6-7 chapters of Hadoop The definitive Guide book to develop initial 
understanding. Indeed knowing a couple of divide and conquer algorithms is a 
pre-requisite and I assume everyone on this mailing list is very familiar :)
3. Having used a lot of different APIs and layers of abstraction for Hadoop MR, 
my experience progressing from MR Java API -- Cascading -- Scalding is that 
each new API looks simpler than the previous one. However, Spark API and 
abstraction has been simplest. Not only for me but those who I have seen start 
with Hadoop MR or Spark first. It is easiest to get started and become 
productive with Spark with the exception of Hive for those who are already 
familiar with SQL. Spark's ease of use is critical for teams starting out with 
Big Data.
4. It is also extremely simple to chain multi-stage jobs in Spark, you do it 
without even realizing by operating over RDDs. In Hadoop MR, one has to handle 
it explicitly.
5. Spark has built-in support for graph algorithms (including Bulk Synchronous 
Parallel processing BSP algorithms e.g. Pregel), Machine Learning and Stream 
processing. In Hadoop MR you need a separate library/Framework for each and it 
is non-trivial to combine multiple of these in the same application. This is 
huge!
6. In Spark one does have to learn how to configure the memory and other 
parameters of their cluster. Just to be clear, similar parameters exist in MR 
as well (e.g. shuffle memory parameters) but you don't *have* to learn about 
tuning them until you have jobs with larger data size jobs. In Spark you learn 
this by reading the configuration and tuning documentation followed by 
experimentation. This is an area of Spark where things can be better.
Java or Scala : I knew Java already yet I learnt Scala when I came across 
Spark. As others have said, you can get started with a little bit of Scala and 
learn more as you progress. Once you have started using Scala for a few weeks 
you would want to stay with it instead of going back to Java. Scala is arguably 
more elegant and less verbose than Java which translates into higher developer 
productivity and more maintainable code.
Myth: Spark is for in-memory processing *only*. This is a common beginner 
misunderstanding.
Sanjay: Spark uses Hadoop API for performing I/O from file systems (local, 
HDFS, S3 etc). Therefore you can use the same Hadoop InputFormat and 
RecordReader with Spark that you use with Hadoop for your multi-line record 
format. See SparkContext APIs. Just like Hadoop, you will need to make sure 
that your files are split at record boundaries.
Hope this is helpful.



On Sun, Nov 23, 2014 at 8:35 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com.invalid wrote:

I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively 
before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis.
To get me jumpstarted into Spark I started this gitHub where there is 
IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will 
keep adding to that. I dont know scala and I am learning that too to help me 
use Spark better.https://github.com/sanjaysubramanian/msfx_scala.git

Philosophically speaking its possibly not a good idea to take an either/or 
approach to technology...Like its never going to be either RDBMS or NOSQL (If 
the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a 
day for some reason u may not be as upset...but if the Oracle/Db2 systems 
behind Wells Fargo show $100 LESS in your account due to an database error, you 
will be PANIC-ing).

So its the same case with Spark or Hadoop. I can speak for myself. I have a 
usecase for processing old logs that are multiline (i.e. they have a 
[begin_timestamp_logid] and [end_timestamp_logid] and have many lines in  
between. In Java Hadoop I created custom RecordReaders to solve this. I still 
dont know how to do this in Spark. 

Re: SparkSQL Timestamp query failure

2014-11-23 Thread whitebread
Cheng thanks,

thanks to you I found out that the problem as you guessed was a precision one. 

2012-10-08 16:10:36 instead of 2012-10-08 16:10:36.0


Thanks again.

Alessandro


 On Nov 23, 2014, at 11:10 PM, Cheng, Hao [via Apache Spark User List] 
 ml-node+s1001560n19613...@n3.nabble.com wrote:
 
 Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs 
 LIMIT 5”, I guess you probably ran into the timestamp precision or the 
 timezone shifting problem.
 
  
 
 (And it’s not mandatory, but you’d better change the field name from 
 “timestamp” to something else, as “timestamp” is the keyword of data type in 
 Hive/Spark SQL.)
 
   
 From: Alessandro Panebianco [mailto:[hidden email] 
 x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=0] 
 Sent: Monday, November 24, 2014 11:12 AM
 To: Wang, Daoyuan
 Cc: [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=1
 Subject: Re: SparkSQL Timestamp query failure
 
  
 
 Hey Daoyuan, 
 
  
 
 following your suggestion I obtain the same result as when I do:
 
  
 
 where l.timestamp = '2012-10-08 16:10:36.0’
 
  
 
 what happens using either your suggestion or simply using single quotes as I 
 just typed in the example before is that the query does not fail but it 
 doesn’t return anything either as it should.
 
  
 
 If I do a simple :
 
  
 
 SELECT timestamp FROM Logs limit 5).collect.foreach(println) 
 
  
 
 I get: 
 
  
 
 [2012-10-08 16:10:36.0]
 
 [2012-10-08 16:10:36.0]
 
 [2012-10-08 16:10:36.0]
 
 [2012-10-08 16:10:41.0]
 
 [2012-10-08 16:10:41.0]
 
  
 
 that is why I am sure that putting one of those timestamps should not return 
 an empty arrray.
 
  
 
 Id really love to find a solution to this problem. Since Spark supports 
 Timestamp it should provide simple comparison actions with them in my opinion.
 
  
 
 Any other help would be greatly appreciated.
 
  
 
 Alessandro
 
  
 
  
 
  
 
  
 
 On Nov 23, 2014, at 8:10 PM, Wang, Daoyuan [hidden email] 
 x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=2 wrote:
 
  
 
 Hi,
 
 I think you can try
 cast(l.timestamp as string)='2012-10-08 16:10:36.0'
 
 Thanks,
 Daoyuan
 
 -Original Message-
 From: whitebread [[hidden email] 
 x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=3] 
 Sent: Sunday, November 23, 2014 12:11 AM
 To: [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=4
 Subject: Re: SparkSQL Timestamp query failure
 
 Thanks for your answer Akhil, 
 
 I have already tried that and the query actually doesn't fail but it doesn't 
 return anything either as it should.
 Using single quotes I think it reads it as a string and not as a timestamp. 
 
 I don't know how to solve this. Any other hint by any chance?
 
 Thanks,
 
 Alessandro
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 http://nabble.com/.
 
 -
 To unsubscribe, e-mail: [hidden email] 
 x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=5 For additional 
 commands, e-mail: [hidden email] 
 x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=6
  
 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19613.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19613.html
 To unsubscribe from SparkSQL Timestamp query failure, click here 
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19502code=YWxlLnBhbmViaWFuY29AbWUuY29tfDE5NTAyfC00MjA1ODk4MTE=.
 NAML 
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.