Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-13 Thread shyla deshpande
Hello,

My spark streaming app that reads kafka topics and prints the DStream works
fine on my laptop, but on AWS cluster it produces no output and no errors.

Please help me debug.

I am using Spark 2.0.2 and kafka-0-10

Thanks

The following is the output of the spark streaming app...


17/01/14 06:22:41 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
17/01/14 06:22:43 WARN Checkpoint: Checkpoint directory check1 does not exist
Creating new context
17/01/14 06:22:45 WARN SparkContext: Use an existing SparkContext,
some configuration may not take effect.
17/01/14 06:22:45 WARN KafkaUtils: overriding enable.auto.commit to
false for executor
17/01/14 06:22:45 WARN KafkaUtils: overriding auto.offset.reset to
none for executor
17/01/14 06:22:45 WARN KafkaUtils: overriding executor group.id to
spark-executor-whilDataStream
17/01/14 06:22:45 WARN KafkaUtils: overriding receive.buffer.bytes to
65536 see KAFKA-3135


Debugging a PythonException with no details

2017-01-13 Thread Nicholas Chammas
I’m looking for tips on how to debug a PythonException that’s very sparse
on details. The full exception is below, but the only interesting bits
appear to be the following lines:

org.apache.spark.api.python.PythonException:
...
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext

Otherwise, the only other clue from the traceback I can see is that the
problem may involve a UDF somehow.

I’ve tested this code against many datasets (stored as ORC) and it works
fine. The same code only seems to throw this error on a few datasets that
happen to be sourced via JDBC. I can’t seem to get a lead on what might be
going wrong here.

Does anyone have tips on how to debug a problem like this? How do I find
more specifically what is going wrong?

Nick

Here’s the full exception:

17/01/13 17:12:14 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID
15, devlx023.private.massmutual.com, executor 4):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 161, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 97, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 78, in read_single_udf
f, return_type = read_command(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 54, in read_command
command = serializer._read_with_length(file)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py",
line 169, in _read_with_length
return self.loads(obj)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py",
line 431, in loads
return pickle.loads(obj, encoding=encoding)
  File 
"/hadoop/yarn/nm/usercache/jenkins/appcache/application_1483203887152_1207/container_1483203887152_1207_01_05/splinkr/person.py",
line 111, in 
py_normalize_udf = udf(py_normalize, StringType())
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1868, in udf
return UserDefinedFunction(f, returnType)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1826, in __init__
self._judf = self._create_judf(name)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1830, in _create_judf
sc = SparkContext.getOrCreate()
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 307, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 118, in __init__
conf, jsc, profiler_cls)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 179, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 246, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1401, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
at 

Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Tathagata Das
Structured Streaming has a foreach sink, where you can essentially do what
you want with your data. Its easy to create a Kafka producer, and write the
data out to kafka.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach

On Fri, Jan 13, 2017 at 8:28 AM, Koert Kuipers  wrote:

> how do you do this with structured streaming? i see no mention of writing
> to kafka
>
> On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian 
> wrote:
>
>> Yes, it is called Structured Streaming: https://docs.databr
>> icks.com/_static/notebooks/structured-streaming-kafka.html
>> http://spark.apache.org/docs/latest/structured-streaming-pro
>> gramming-guide.html
>>
>> On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar 
>> wrote:
>>
>>> Hi Team ,
>>>
>>>  Sorry if this question already asked in this forum..
>>>
>>> Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ??
>>>
>>> Here is my Code which Reads Parquet File :
>>>
>>> *val sqlContext = new org.apache.spark.sql.SQLContext(sc);*
>>>
>>> *val df = sqlContext.read.parquet("/temp/*.parquet")*
>>>
>>> *df.registerTempTable("beacons")*
>>>
>>>
>>> I want to directly ingest df DataFrame to Kafka ! Is there any way to
>>> achieve this ??
>>>
>>>
>>> Cheers,
>>>
>>> Senthil
>>>
>>
>>
>


filter rows based on all columns

2017-01-13 Thread Xiaomeng Wan
I need to filter out outliers from a dataframe on all columns. I can
manually list all columns like:

df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0))

.filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1
))

...

But I want to turn it into a general function which can handle variable
number of columns. How could I do that? Thanks in advance!


Regards,

Shawn


Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu
In terms of the nullPointerException, i think it is bug. since the test
data directories might be moved already. so it failed to load the test data
to create the test tables. You may create a jira for this.

On Fri, Jan 13, 2017 at 11:44 AM, Xin Wu  wrote:

> If you are using spark-shell, you have instance "sc" as the SparkContext
> initialized already. If you are writing your own application, you need to
> create a SparkSession, which comes with the SparkContext. So you can
> reference it like sparkSession.sparkContext.
>
> In terms of creating a table from DataFrame, do you intend to create it
> via TestHive? or just want to create a Hive serde table for the DataFrame?
>
> On Fri, Jan 13, 2017 at 10:23 AM, Nicolas Tallineau <
> nicolas.tallin...@ubisoft.com> wrote:
>
>> But it forces you to create your own SparkContext, which I’d rather not
>> do.
>>
>>
>>
>> Also it doesn’t seem to allow me to directly create a table from a
>> DataFrame, as follow:
>>
>>
>>
>> TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table")
>>
>>
>>
>> *From:* Xin Wu [mailto:xwu0...@gmail.com]
>> *Sent:* 13 janvier 2017 12:43
>> *To:* Nicolas Tallineau 
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: [Spark SQL - Scala] TestHive not working in Spark 2
>>
>>
>>
>> I used the following:
>>
>>
>> val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc,
>> *false*)
>>
>> val hiveClient = testHive.sessionState.metadataHive
>> hiveClient.runSqlHive(“….”)
>>
>>
>>
>> On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau <
>> nicolas.tallin...@ubisoft.com> wrote:
>>
>> I get a nullPointerException as soon as I try to execute a
>> TestHive.sql(...) statement since migrating to Spark 2 because it's trying
>> to load non existing "test tables". I couldn't find a way to switch to
>> false the loadTestTables variable.
>>
>>
>>
>> Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.getHiveFile(TestHive.scala:190)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.org$apache$spark$sql$hive$test$TestHiv
>> eSparkSession$$quoteHiveFile(TestHive.scala:196)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.(TestHive.scala:234)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveSparkSession.(TestHive.scala:122)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHiveContext.(TestHive.scala:80)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHive$.(TestHive.scala:47)
>>
>> at org.apache.spark.sql.hive.test
>> .TestHive$.(TestHive.scala)
>>
>>
>>
>> I’m using Spark 2.1.0 in this case.
>>
>>
>>
>> Am I missing something or should I create a bug in Jira?
>>
>>
>>
>>
>>
>> --
>>
>> Xin Wu
>> (650)392-9799 <(650)%20392-9799>
>>
>
>
>
> --
> Xin Wu
> (650)392-9799 <(650)%20392-9799>
>



-- 
Xin Wu
(650)392-9799


Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu
If you are using spark-shell, you have instance "sc" as the SparkContext
initialized already. If you are writing your own application, you need to
create a SparkSession, which comes with the SparkContext. So you can
reference it like sparkSession.sparkContext.

In terms of creating a table from DataFrame, do you intend to create it via
TestHive? or just want to create a Hive serde table for the DataFrame?

On Fri, Jan 13, 2017 at 10:23 AM, Nicolas Tallineau <
nicolas.tallin...@ubisoft.com> wrote:

> But it forces you to create your own SparkContext, which I’d rather not do.
>
>
>
> Also it doesn’t seem to allow me to directly create a table from a
> DataFrame, as follow:
>
>
>
> TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table")
>
>
>
> *From:* Xin Wu [mailto:xwu0...@gmail.com]
> *Sent:* 13 janvier 2017 12:43
> *To:* Nicolas Tallineau 
> *Cc:* user@spark.apache.org
> *Subject:* Re: [Spark SQL - Scala] TestHive not working in Spark 2
>
>
>
> I used the following:
>
>
> val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc,
> *false*)
>
> val hiveClient = testHive.sessionState.metadataHive
> hiveClient.runSqlHive(“….”)
>
>
>
> On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau <
> nicolas.tallin...@ubisoft.com> wrote:
>
> I get a nullPointerException as soon as I try to execute a
> TestHive.sql(...) statement since migrating to Spark 2 because it's trying
> to load non existing "test tables". I couldn't find a way to switch to
> false the loadTestTables variable.
>
>
>
> Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.
> getHiveFile(TestHive.scala:190)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.org
> $apache$spark$sql$hive$test$TestHiveSparkSession$$
> quoteHiveFile(TestHive.scala:196)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:234)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:122)
>
> at org.apache.spark.sql.hive.test.TestHiveContext.(
> TestHive.scala:80)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala:47)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala)
>
>
>
> I’m using Spark 2.1.0 in this case.
>
>
>
> Am I missing something or should I create a bug in Jira?
>
>
>
>
>
> --
>
> Xin Wu
> (650)392-9799 <(650)%20392-9799>
>



-- 
Xin Wu
(650)392-9799


RE: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Nicolas Tallineau
But it forces you to create your own SparkContext, which I’d rather not do.

Also it doesn’t seem to allow me to directly create a table from a DataFrame, 
as follow:

TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table")

From: Xin Wu [mailto:xwu0...@gmail.com]
Sent: 13 janvier 2017 12:43
To: Nicolas Tallineau 
Cc: user@spark.apache.org
Subject: Re: [Spark SQL - Scala] TestHive not working in Spark 2

I used the following:

val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc, false)
val hiveClient = testHive.sessionState.metadataHive
hiveClient.runSqlHive(“….”)

On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau 
> wrote:
I get a nullPointerException as soon as I try to execute a TestHive.sql(...) 
statement since migrating to Spark 2 because it's trying to load non existing 
"test tables". I couldn't find a way to switch to false the loadTestTables 
variable.

Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:190)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.org$apache$spark$sql$hive$test$TestHiveSparkSession$$quoteHiveFile(TestHive.scala:196)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:234)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
at 
org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:80)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala:47)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala)

I’m using Spark 2.1.0 in this case.

Am I missing something or should I create a bug in Jira?



--
Xin Wu
(650)392-9799


Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu
I used the following:

val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc,
*false*)
val hiveClient = testHive.sessionState.metadataHive
hiveClient.runSqlHive(“….”)



On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau <
nicolas.tallin...@ubisoft.com> wrote:

> I get a nullPointerException as soon as I try to execute a
> TestHive.sql(...) statement since migrating to Spark 2 because it's trying
> to load non existing "test tables". I couldn't find a way to switch to
> false the loadTestTables variable.
>
>
>
> Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.
> getHiveFile(TestHive.scala:190)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.org
> $apache$spark$sql$hive$test$TestHiveSparkSession$$
> quoteHiveFile(TestHive.scala:196)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:234)
>
> at org.apache.spark.sql.hive.test.TestHiveSparkSession.<
> init>(TestHive.scala:122)
>
> at org.apache.spark.sql.hive.test.TestHiveContext.(
> TestHive.scala:80)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala:47)
>
> at org.apache.spark.sql.hive.test.TestHive$.(
> TestHive.scala)
>
>
>
> I’m using Spark 2.1.0 in this case.
>
>
>
> Am I missing something or should I create a bug in Jira?
>



-- 
Xin Wu
(650)392-9799


Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Koert Kuipers
how do you do this with structured streaming? i see no mention of writing
to kafka

On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian 
wrote:

> Yes, it is called Structured Streaming: https://docs.
> databricks.com/_static/notebooks/structured-streaming-kafka.html
> http://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html
>
> On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar 
> wrote:
>
>> Hi Team ,
>>
>>  Sorry if this question already asked in this forum..
>>
>> Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ??
>>
>> Here is my Code which Reads Parquet File :
>>
>> *val sqlContext = new org.apache.spark.sql.SQLContext(sc);*
>>
>> *val df = sqlContext.read.parquet("/temp/*.parquet")*
>>
>> *df.registerTempTable("beacons")*
>>
>>
>> I want to directly ingest df DataFrame to Kafka ! Is there any way to
>> achieve this ??
>>
>>
>> Cheers,
>>
>> Senthil
>>
>
>


Re: Schema evolution in tables

2017-01-13 Thread sim
There is not automated solution right now. You have to issue manual ALTER
TABLE commands, which works for adding top-level columns but gets tricky if
you are adding a field in a deeply nested struct.

Hopefully, the issue will be fixed in 2.2 because work has started on
https://issues.apache.org/jira/browse/SPARK-18727



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-evolution-in-tables-tp23999p28305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Anahita Talebi
Hello,

Thanks a lot Dinko.
Yes, now it is working perfectly.


Cheers,
Anahita

On Fri, Jan 13, 2017 at 2:19 PM, Dinko Srkoč  wrote:

> On 13 January 2017 at 13:55, Anahita Talebi 
> wrote:
> > Hi,
> >
> > Thanks for your answer.
> >
> > I have chose "Spark" in the "job type". There is not any option where we
> can
> > choose the version. How I can choose different version?
>
> There's "Preemptible workers, bucket, network, version,
> initialization, & access options" link just above the "Create" and
> "Cancel" buttons on the "Create a cluster" page. When you click it,
> you'll find "Image version" field where you can enter the image
> version.
>
> Dataproc versions:
> * 1.1 would be Spark 2.0.2,
> * 1.0 includes Spark 1.6.2
>
> More about versions can be found here:
> https://cloud.google.com/dataproc/docs/concepts/dataproc-versions
>
> Cheers,
> Dinko
>
> >
> > Thanks,
> > Anahita
> >
> >
> > On Thu, Jan 12, 2017 at 6:39 PM, A Shaikh 
> wrote:
> >>
> >> You may have tested this code on Spark version on your local machine
> >> version of which may be different to whats in Google Cloud Storage.
> >> You need to select appropraite Spark version when you submit your job.
> >>
> >> On 12 January 2017 at 15:51, Anahita Talebi 
> >> wrote:
> >>>
> >>> Dear all,
> >>>
> >>> I am trying to run a .jar file as a job using submit job in google
> cloud
> >>> console.
> >>> https://cloud.google.com/dataproc/docs/guides/submit-job
> >>>
> >>> I actually ran the spark code on my local computer to generate a .jar
> >>> file. Then in the Argument folder, I give the value of the arguments
> that I
> >>> used in the spark code. One of the argument is training data set that
> I put
> >>> in the same bucket that I save my .jar file. In the bucket, I put only
> the
> >>> .jar file, training dataset and testing dataset.
> >>>
> >>> Main class or jar
> >>> gs://Anahita/test.jar
> >>>
> >>> Arguments
> >>>
> >>> --lambda=.001
> >>> --eta=1.0
> >>> --trainFile=gs://Anahita/small_train.dat
> >>> --testFile=gs://Anahita/small_test.dat
> >>>
> >>> The problem is that when I run the job I get the following error and
> >>> actually it cannot read  my training and testing data sets.
> >>>
> >>> Exception in thread "main" java.lang.NoSuchMethodError:
> >>> org.apache.spark.rdd.RDD.coalesce(IZLscala/math/
> Ordering;)Lorg/apache/spark/rdd/RDD;
> >>>
> >>> Can anyone help me how I can solve this problem?
> >>>
> >>> Thanks,
> >>>
> >>> Anahita
> >>>
> >>>
> >>
> >
>


Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Peyman Mohajerian
Yes, it is called Structured Streaming:
https://docs.databricks.com/_static/notebooks/structured-streaming-kafka.html
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar 
wrote:

> Hi Team ,
>
>  Sorry if this question already asked in this forum..
>
> Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ??
>
> Here is my Code which Reads Parquet File :
>
> *val sqlContext = new org.apache.spark.sql.SQLContext(sc);*
>
> *val df = sqlContext.read.parquet("/temp/*.parquet")*
>
> *df.registerTempTable("beacons")*
>
>
> I want to directly ingest df DataFrame to Kafka ! Is there any way to
> achieve this ??
>
>
> Cheers,
>
> Senthil
>


[Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Nicolas Tallineau
I get a nullPointerException as soon as I try to execute a TestHive.sql(...) 
statement since migrating to Spark 2 because it's trying to load non existing 
"test tables". I couldn't find a way to switch to false the loadTestTables 
variable.

Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:190)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.org$apache$spark$sql$hive$test$TestHiveSparkSession$$quoteHiveFile(TestHive.scala:196)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:234)
at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
at 
org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:80)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala:47)
at 
org.apache.spark.sql.hive.test.TestHive$.(TestHive.scala)

I'm using Spark 2.1.0 in this case.

Am I missing something or should I create a bug in Jira?


Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Dinko Srkoč
On 13 January 2017 at 13:55, Anahita Talebi  wrote:
> Hi,
>
> Thanks for your answer.
>
> I have chose "Spark" in the "job type". There is not any option where we can
> choose the version. How I can choose different version?

There's "Preemptible workers, bucket, network, version,
initialization, & access options" link just above the "Create" and
"Cancel" buttons on the "Create a cluster" page. When you click it,
you'll find "Image version" field where you can enter the image
version.

Dataproc versions:
* 1.1 would be Spark 2.0.2,
* 1.0 includes Spark 1.6.2

More about versions can be found here:
https://cloud.google.com/dataproc/docs/concepts/dataproc-versions

Cheers,
Dinko

>
> Thanks,
> Anahita
>
>
> On Thu, Jan 12, 2017 at 6:39 PM, A Shaikh  wrote:
>>
>> You may have tested this code on Spark version on your local machine
>> version of which may be different to whats in Google Cloud Storage.
>> You need to select appropraite Spark version when you submit your job.
>>
>> On 12 January 2017 at 15:51, Anahita Talebi 
>> wrote:
>>>
>>> Dear all,
>>>
>>> I am trying to run a .jar file as a job using submit job in google cloud
>>> console.
>>> https://cloud.google.com/dataproc/docs/guides/submit-job
>>>
>>> I actually ran the spark code on my local computer to generate a .jar
>>> file. Then in the Argument folder, I give the value of the arguments that I
>>> used in the spark code. One of the argument is training data set that I put
>>> in the same bucket that I save my .jar file. In the bucket, I put only the
>>> .jar file, training dataset and testing dataset.
>>>
>>> Main class or jar
>>> gs://Anahita/test.jar
>>>
>>> Arguments
>>>
>>> --lambda=.001
>>> --eta=1.0
>>> --trainFile=gs://Anahita/small_train.dat
>>> --testFile=gs://Anahita/small_test.dat
>>>
>>> The problem is that when I run the job I get the following error and
>>> actually it cannot read  my training and testing data sets.
>>>
>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>> org.apache.spark.rdd.RDD.coalesce(IZLscala/math/Ordering;)Lorg/apache/spark/rdd/RDD;
>>>
>>> Can anyone help me how I can solve this problem?
>>>
>>> Thanks,
>>>
>>> Anahita
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Anahita Talebi
Hi,

Thanks for your answer.

I have chose "Spark" in the "job type". There is not any option where we
can choose the version. How I can choose different version?

Thanks,
Anahita


On Thu, Jan 12, 2017 at 6:39 PM, A Shaikh  wrote:

> You may have tested this code on Spark version on your local machine
> version of which may be different to whats in Google Cloud Storage.
> You need to select appropraite Spark version when you submit your job.
>
> On 12 January 2017 at 15:51, Anahita Talebi 
> wrote:
>
>> Dear all,
>>
>> I am trying to run a .jar file as a job using submit job in google cloud
>> console.
>> https://cloud.google.com/dataproc/docs/guides/submit-job
>>
>> I actually ran the spark code on my local computer to generate a .jar
>> file. Then in the Argument folder, I give the value of the arguments that I
>> used in the spark code. One of the argument is training data set that I put
>> in the same bucket that I save my .jar file. In the bucket, I put only the
>> .jar file, training dataset and testing dataset.
>>
>> Main class or jar
>> gs://Anahita/test.jar
>>
>> Arguments
>>
>> --lambda=.001
>> --eta=1.0
>> --trainFile=gs://Anahita/small_train.dat
>> --testFile=gs://Anahita/small_test.dat
>>
>> The problem is that when I run the job I get the following error and
>> actually it cannot read  my training and testing data sets.
>>
>> Exception in thread "main" java.lang.NoSuchMethodError: 
>> org.apache.spark.rdd.RDD.coalesce(IZLscala/math/Ordering;)Lorg/apache/spark/rdd/RDD;
>>
>> Can anyone help me how I can solve this problem?
>>
>> Thanks,
>>
>> Anahita
>>
>>
>>
>


Re: Spark in docker over EC2

2017-01-13 Thread Teng Qiu
Hi, you can take a look at this project, it is a distributed HA Spark
cluster for AWS environment using Docker, we put the spark ec2
instances in an ELB, and using this code snippet to get the instance
IPs: 
https://github.com/zalando-incubator/spark-appliance/blob/master/utils.py#L49-L56

Dockerfil: 
https://github.com/zalando-incubator/spark-appliance/blob/master/Dockerfile

2017-01-11 0:49 GMT+01:00 Darren Govoni :
> Anyone got a good guide for getting spark master to talk to remote workers
> inside dockers? I followed the tips found by searching but doesn't work
> still. Spark 1.6.2.
>
> I exposed all the ports and tried to set local IP inside container to the
> host IP but spark complains it can't bind ui ports.
>
> Thanks in advance!
>
>
>
> Sent from my Verizon, Samsung Galaxy smartphone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Senthil Kumar
Hi Team ,

 Sorry if this question already asked in this forum..

Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ??

Here is my Code which Reads Parquet File :

*val sqlContext = new org.apache.spark.sql.SQLContext(sc);*

*val df = sqlContext.read.parquet("/temp/*.parquet")*

*df.registerTempTable("beacons")*


I want to directly ingest df DataFrame to Kafka ! Is there any way to
achieve this ??


Cheers,

Senthil