Re: Consume WebService in Spark

2016-05-02 Thread Jörn Franke
It is in Spark not different compared to another program. However a web service 
and json is probably not very suitable for large data volumes.

> On 03 May 2016, at 04:45, KhajaAsmath Mohammed  
> wrote:
> 
> Hi,
> 
> I am working on a project to pull data from sprinklr for every 15 minutes and 
> process that in spark. After processing it, I need to save that back in s3 
> Bukcet.
> 
> is there a way that I can talk to webservice in spark directly and parse the 
> response of the json data ?
> 
> Thanks,
> Asmath

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Performance benchmarking of Spark Vs other languages

2016-05-02 Thread Jörn Franke
Hallo,

Spark is a general framework for distributed in-memory processing. You can 
always write a highly-specified piece of code which is faster than Spark, but 
then it can do only one thing and if you need something else you will have to 
rewrite everything from scratch . This is why Spark is beneficial.
In this context, your setup does not make sense. You should have at least 5 
worker nodes to make evaluations.
Follow the Spark tuning and recommendation guide.

> On 03 May 2016, at 07:02, Abhijith Chandraprabhu  wrote:
> 
> Hello,
> 
> I am trying to find some performance figures of spark vs various other 
> languages for ALS based recommender system. I am using 20 million ratings 
> movielens dataset. The test environment involves one big 30 core machine with 
> 132 GB memory. I am using the scala version of the script provided here,
> http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html 
> 
> I am not an expert in spark, and I assume that varying the n while invoking 
> spark with following flags, --master local[n], is supposed to provide ideal 
> scaling. 
> 
> Initial observations didnt favour spark by some small margins, but as I said 
> since I am not a spark expert, I would comment only after being assured that 
> this is the most optimal way of running the ALS snippet. 
> 
> Could the experts please help me with the most optimal way to get the best 
> timings out of sparks ALS example on the mentioned environment. Thanks.
> 
> -- 
> Best regards,
> Abhijith


Performance benchmarking of Spark Vs other languages

2016-05-02 Thread Abhijith Chandraprabhu
Hello,

I am trying to find some performance figures of spark vs various other
languages for ALS based recommender system. I am using 20 million ratings
movielens dataset. The test environment involves one big 30 core machine
with 132 GB memory. I am using the scala version of the script provided
here,
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html


I am not an expert in spark, and I assume that varying the n while invoking
spark with following flags, --master local[n], is supposed to provide ideal
scaling.

Initial observations didnt favour spark by some small margins, but as I
said since I am not a spark expert, I would comment only after being
assured that this is the most optimal way of running the ALS snippet.

Could the experts please help me with the most optimal way to get the best
timings out of sparks ALS example on the mentioned environment. Thanks.

-- 
Best regards,
Abhijith


Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
print() isn't really the best way to benchmark things, since it calls
take(10) under the covers, but 380 records / second for a single
receiver doesn't sound right in any case.

Am I understanding correctly that you're trying to process a large
number of already-existing kafka messages, not keep up with an
incoming stream?  Can you give any details (e.g. hardware, number of
topicpartitions, etc)?

Really though, I'd try to start with spark 1.6 and direct streams, or
even just kafkacat, as a baseline.



On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams  wrote:
> Hello again. I searched for "backport kafka" in the list archives but
> couldn't find anything but a post from Spark 0.7.2 . I was going to
> use accumulators to make a counter, but then saw on the Streaming tab
> the Receiver Statistics. Then I removed all other "functionality"
> except:
>
>
> JavaPairReceiverInputDStream dstream = KafkaUtils
>   //createStream(JavaStreamingContext jssc,Class
> keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
> Class valueDecoderClass, java.util.Map kafkaParams,
> java.util.Map topics, StorageLevel storageLevel)
>   .createStream(jssc, byte[].class, byte[].class,
> kafka.serializer.DefaultDecoder.class,
> kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
> StorageLevel.MEMORY_AND_DISK_SER());
>
>dstream.print();
>
> Then in the Recieiver Stats for the single receiver, I'm seeing around
> 380 records / second. Then to get anywhere near my 10% mentioned
> above, I'd need to run around 21 receivers, assuming 380 records /
> second, just using the print output. This seems awfully high to me,
> considering that I wrote 8+ records a second to Kafka from a
> mapreduce job, and that my bottleneck was likely Hbase. Again using
> the 380 estimate, I would need 200+ receivers to reach a similar
> amount of reads.
>
> Even given the issues with the 1.2 receivers, is this the expected way
> to use the Kafka streaming API, or am I doing something terribly
> wrong?
>
> My application looks like
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
>> Have you tested for read throughput (without writing to hbase, just
>> deserialize)?
>>
>> Are you limited to using spark 1.2, or is upgrading possible?  The
>> kafka direct stream is available starting with 1.3.  If you're stuck
>> on 1.2, I believe there have been some attempts to backport it, search
>> the mailing list archives.
>>
>> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
>> wrote:
>>> I've written an application to get content from a kafka topic with 1.7
>>> billion entries,  get the protobuf serialized entries, and insert into
>>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>>
>>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>>> 0-2500 writes / second. This will take much too long to consume the
>>> entries.
>>>
>>> I currently believe that the spark kafka receiver is the bottleneck.
>>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>>> notice any large performance difference. I've tried many different
>>> spark configuration options, but can't seem to get better performance.
>>>
>>> I saw 8 requests / second inserting these records into kafka using
>>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>>
>>> While hbase inserts might not deliver the same throughput, I'd like to
>>> at least get 10%.
>>>
>>> My application looks like
>>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>>
>>> This is my first spark application. I'd appreciate any assistance.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi,

its morning 4:40 here, therefore I might not be getting things right.

But there is a very high chance of getting spurious results in case you
have created that variable more than once in IPython or pyspark shell and
cached it and are re using it. Please close the sessions and create the
variable only once and then add the table (after ensuring that you have
dropped if first) and then check it.

I had faced this issue once and then realized it was because of the above
reasons.


Regards,
Gourav

On Mon, May 2, 2016 at 10:13 PM, Kevin Peng  wrote:

> Yong,
>
> Sorry, let explain my deduction; it is going be difficult to get a sample
> data out since the dataset I am using is proprietary.
>
> From the above set queries (ones mentioned in above comments) both inner
> and outer join are producing the same counts.  They are basically pulling
> out selected columns from the query, but there is no roll up happening or
> anything that would possible make it suspicious that there is any
> difference besides the type of joins.  The tables are matched based on
> three keys that are present in both tables (ad, account, and date), on top
> of this they are filtered by date being above 2016-01-03.  Since all the
> joins are producing the same counts, the natural suspicions is that the
> tables are identical, but I when I run the following two queries:
>
> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >='2016-01-03'").count
>
> res14: Long = 34158
>
> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >='2016-01-03'").count
>
> res15: Long = 42693
>
>
> The above two queries filter out the data based on date used by the joins
> of 2016-01-03 and you can see the row count between the two tables are
> different, which is why I am suspecting something is wrong with the outer
> joins in spark sql, because in this situation the right and outer joins may
> produce the same results, but it should not be equal to the left join and
> definitely not the inner join; unless I am missing something.
>
>
> Side note: In my haste response above I posted the wrong counts for
> dps.count, the real value is res16: Long = 42694
>
>
> Thanks,
>
>
> KP
>
>
>
> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang  wrote:
>
>> We are still not sure what is the problem, if you cannot show us with
>> some example data.
>>
>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
>> with swig on 3 columns; with additional filters, get the same resultSet row
>> count as dps lefter outer join with swig on 3 columns, with additional
>> filters, again get the the same resultSet row count as dps right outer join
>> with swig on 3 columns, with same additional filters.
>>
>> Without knowing your data, I cannot see the reason that has to be a bug
>> in the spark.
>>
>> Am I misunderstanding your bug?
>>
>> Yong
>>
>> --
>> From: kpe...@gmail.com
>> Date: Mon, 2 May 2016 12:11:18 -0700
>> Subject: Re: Weird results with Spark SQL Outer joins
>> To: gourav.sengu...@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> Gourav,
>>
>> I wish that was case, but I have done a select count on each of the two
>> tables individually and they return back different number of rows:
>>
>>
>> dps.registerTempTable("dps_pin_promo_lt")
>> swig.registerTempTable("swig_pin_promo_lt")
>>
>>
>> dps.count()
>> RESULT: 42632
>>
>>
>> swig.count()
>> RESULT: 42034
>>
>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>> This shows that both the tables have matching records and no mismatches.
>> Therefore obviously you have the same results irrespective of whether you
>> use right or left join.
>>
>> I think that there is no problem here, unless I am missing something.
>>
>> Regards,
>> Gourav
>>
>> On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
>>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>


?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread sunday2000
I use this command:


build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr 
-Dhadoop.version=2.7.0 package -DskipTests -X


and get this failure message:
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 14.567 s
[INFO] Finished at: 2016-05-03T11:40:32+08:00
[INFO] Final Memory: 42M/192M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more
Caused by: javac returned nonzero exit code
at sbt.compiler.JavaCompiler$JavaTool0.compile(JavaCompiler.scala:77)
at sbt.compiler.JavaTool$class.apply(JavaCompiler.scala:35)
at sbt.compiler.JavaCompiler$JavaTool0.apply(JavaCompiler.scala:63)
at sbt.compiler.JavaCompiler$class.compile(JavaCompiler.scala:21)
at sbt.compiler.JavaCompiler$JavaTool0.compile(JavaCompiler.scala:63)
at 
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileJava$1$1.apply$mcV$sp(AggressiveCompile.scala:127)
at 
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileJava$1$1.apply(AggressiveCompile.scala:127)
at 
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileJava$1$1.apply(AggressiveCompile.scala:127)
at 
sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:166)
at 
sbt.compiler.AggressiveCompile$$anonfun$3.compileJava$1(AggressiveCompile.scala:126)
at 
sbt.compiler.AggressiveCompile$$anonfun$3.apply(AggressiveCompile.scala:143)
at 
sbt.compiler.AggressiveCompile$$anonfun$3.apply(AggressiveCompile.scala:87)
at 
sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:39)
at 
sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:37)
at sbt.inc.IncrementalCommon.cycle(Incremental.scala:99)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:38)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:37)
at sbt.inc.Incremental$.manageClassfiles(Incremental.scala:65)
at sbt.inc.Incremental$.compile(Incremental.scala:37)
at 

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread Ted Yu
I rebuilt 1.6.1 locally:

[INFO] Spark Project External Kafka ... SUCCESS [
30.868 s]
[INFO] Spark Project Examples . SUCCESS [02:29
min]
[INFO] Spark Project External Kafka Assembly .. SUCCESS [
 9.644 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 23:51 min
[INFO] Finished at: 2016-05-02T20:17:36-07:00
[INFO] Final Memory: 116M/4074M

Here was the command I used:

build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.0 package -DskipTests

I used Java 1.8.0_65

Can you specify -X on mvn command line ?

On Mon, May 2, 2016 at 7:51 PM, sunday2000 <2314476...@qq.com> wrote:

> Hi,
>
>   This is not a continuation of a previous query, and now building  by
> connect to inernet without a proxy as before.
>
>   After disable Zinc, get this errormessage:
>
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-test-tags_2.10: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-test-tags_2.10: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> java version:
> java -version
> java version "1.8.0_91"
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
>
> maven version:
> spark-1.6.1/build/apache-maven-3.3.3/bin/mvn
>
>
>
> -- 原始邮件 --
> *发件人:* "Ted Yu";;
> *发送时间:* 2016年5月3日(星期二) 上午10:43
> *收件人:* "sunday2000"<2314476...@qq.com>;
> *抄送:* "user";
> *主题:* Re: spark 1.6.1 build failure of : scala-maven-plugin
>
> Looks like this was continuation of your previous query.
>
> If that is the case, please use original thread so that people can have
> more context.
>
> Have you tried disabling Zinc server ?
>
> What's the version of Java / maven you're using ?
>
> Are you behind proxy ?
>
> Finally the 1.6.1 artifacts are available in many places. Is there any
> modification you have done which requires rebuilding ?
>
> Cheers
>
> On Mon, May 2, 2016 at 7:18 PM, sunday2000 <2314476...@qq.com> wrote:
>
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 14.765 s
>> [INFO] Finished at: 2016-05-03T10:08:46+08:00
>> [INFO] Final Memory: 35M/191M
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
>> on project spark-test-tags_2.10: Execution scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
>> -> [Help 1]
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
>> (scala-compile-first) on project spark-test-tags_2.10: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>> at
>> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
>> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
>>  

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread sunday2000
Hi,

  This is not a continuation of a previous query, and now building  by connect 
to inernet without a proxy as before.

  After disable Zinc, get this errormessage:

[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)

java version:
java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)


maven version:
spark-1.6.1/build/apache-maven-3.3.3/bin/mvn






--  --
??: "Ted Yu";;
: 2016??5??3??(??) 10:43
??: "sunday2000"<2314476...@qq.com>; 
: "user"; 
: Re: spark 1.6.1 build failure of : scala-maven-plugin



Looks like this was continuation of your previous query.

If that is the case, please use original thread so that people can have more 
context.


Have you tried disabling Zinc server ?


What's the version of Java / maven you're using ?


Are you behind proxy ?


Finally the 1.6.1 artifacts are available in many places. Is there any 
modification you have done which requires rebuilding ?


Cheers


On Mon, May 2, 2016 at 7:18 PM, sunday2000 <2314476...@qq.com> wrote:
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 14.765 s
[INFO] Finished at: 2016-05-03T10:08:46+08:00
[INFO] Final Memory: 35M/191M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 

Consume WebService in Spark

2016-05-02 Thread KhajaAsmath Mohammed
Hi,

I am working on a project to pull data from sprinklr for every 15 minutes
and process that in spark. After processing it, I need to save that back in
s3 Bukcet.

is there a way that I can talk to webservice in spark directly and parse
the response of the json data ?

Thanks,
Asmath


Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread Ted Yu
Looks like this was continuation of your previous query.

If that is the case, please use original thread so that people can have
more context.

Have you tried disabling Zinc server ?

What's the version of Java / maven you're using ?

Are you behind proxy ?

Finally the 1.6.1 artifacts are available in many places. Is there any
modification you have done which requires rebuilding ?

Cheers

On Mon, May 2, 2016 at 7:18 PM, sunday2000 <2314476...@qq.com> wrote:

> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 14.765 s
> [INFO] Finished at: 2016-05-03T10:08:46+08:00
> [INFO] Final Memory: 35M/191M
> [INFO]
> 
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-test-tags_2.10: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-test-tags_2.10: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at
> sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at
> scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at
> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> ... 21 more
> [ERROR]
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-test-tags_2.10


spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread sunday2000
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 14.765 s
[INFO] Finished at: 2016-05-03T10:08:46+08:00
[INFO] Final Memory: 35M/191M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more
Caused by: Compile failed via zinc server
at 
sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at 
sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at 
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at 
scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at 
scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
... 21 more
[ERROR] 
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-test-tags_2.10

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Luciano Resende
You might have a settings.xml that is forcing your internal Maven
repository to be the mirror of external repositories and thus not finding
the dependency.

On Mon, May 2, 2016 at 6:11 PM, Hien Luu  wrote:

> Not I am not.  I am considering downloading it manually and place it in my
> local repository.
>
> On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) <
> ravishankar.n...@gmail.com> wrote:
>
>> Oracle jdbc is not part of Maven repository,  are you keeping a
>> downloaded file in your local repo?
>>
>> Best, RS
>> On May 2, 2016 8:51 PM, "Hien Luu"  wrote:
>>
>>> Hi all,
>>>
>>> I am running into a build problem with
>>> com.oracle:ojdbc6:jar:11.2.0.1.0.  It kept getting "Operation timed out"
>>> while building Spark Project Docker Integration Tests module (see the error
>>> below).
>>>
>>> Has anyone run this problem before? If so, how did you resolve around
>>> this problem?
>>>
>>> [INFO] Reactor Summary:
>>>
>>> [INFO]
>>>
>>> [INFO] Spark Project Parent POM ... SUCCESS [
>>> 2.423 s]
>>>
>>> [INFO] Spark Project Test Tags  SUCCESS [
>>> 0.712 s]
>>>
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 0.498 s]
>>>
>>> [INFO] Spark Project Networking ... SUCCESS [
>>> 1.743 s]
>>>
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>> 0.587 s]
>>>
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>> 0.503 s]
>>>
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 4.894 s]
>>>
>>> [INFO] Spark Project Core . SUCCESS [
>>> 17.953 s]
>>>
>>> [INFO] Spark Project GraphX ... SUCCESS [
>>> 3.480 s]
>>>
>>> [INFO] Spark Project Streaming  SUCCESS [
>>> 6.022 s]
>>>
>>> [INFO] Spark Project Catalyst . SUCCESS [
>>> 8.664 s]
>>>
>>> [INFO] Spark Project SQL .. SUCCESS [
>>> 12.440 s]
>>>
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 0.498 s]
>>>
>>> [INFO] Spark Project ML Library ... SUCCESS [
>>> 8.594 s]
>>>
>>> [INFO] Spark Project Tools  SUCCESS [
>>> 0.162 s]
>>>
>>> [INFO] Spark Project Hive . SUCCESS [
>>> 9.834 s]
>>>
>>> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
>>> 1.428 s]
>>>
>>> [INFO] Spark Project Docker Integration Tests . FAILURE
>>> [02:32 min]
>>>
>>> [INFO] Spark Project REPL . SKIPPED
>>>
>>> [INFO] Spark Project Assembly . SKIPPED
>>>
>>> [INFO] Spark Project External Flume Sink .. SKIPPED
>>>
>>> [INFO] Spark Project External Flume ... SKIPPED
>>>
>>> [INFO] Spark Project External Flume Assembly .. SKIPPED
>>>
>>> [INFO] Spark Project External Kafka ... SKIPPED
>>>
>>> [INFO] Spark Project Examples . SKIPPED
>>>
>>> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>>>
>>> [INFO] Spark Project Java 8 Tests . SKIPPED
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] BUILD FAILURE
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] Total time: 03:53 min
>>>
>>> [INFO] Finished at: 2016-05-02T17:44:57-07:00
>>>
>>> [INFO] Final Memory: 80M/1525M
>>>
>>> [INFO]
>>> 
>>>
>>> [ERROR] Failed to execute goal on project
>>> spark-docker-integration-tests_2.11: Could not resolve dependencies for
>>> project
>>> org.apache.spark:spark-docker-integration-tests_2.11:jar:2.0.0-SNAPSHOT:
>>> Failed to collect dependencies at com.oracle:ojdbc6:jar:11.2.0.1.0: Failed
>>> to read artifact descriptor for com.oracle:ojdbc6:jar:11.2.0.1.0: Could not
>>> transfer artifact com.oracle:ojdbc6:pom:11.2.0.1.0 from/to uber-artifactory
>>> (http://artifactory.uber.internal:4587/artifactory/repo/): Connect to
>>> artifactory.uber.internal:4587 [artifactory.uber.internal/10.162.11.61]
>>> failed: Operation timed out -> [Help 1]
>>>
>>>
>
>
> --
> Regards,
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Hien Luu
Not I am not.  I am considering downloading it manually and place it in my
local repository.

On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:

> Oracle jdbc is not part of Maven repository,  are you keeping a downloaded
> file in your local repo?
>
> Best, RS
> On May 2, 2016 8:51 PM, "Hien Luu"  wrote:
>
>> Hi all,
>>
>> I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.
>> It kept getting "Operation timed out" while building Spark Project Docker
>> Integration Tests module (see the error below).
>>
>> Has anyone run this problem before? If so, how did you resolve around
>> this problem?
>>
>> [INFO] Reactor Summary:
>>
>> [INFO]
>>
>> [INFO] Spark Project Parent POM ... SUCCESS [
>> 2.423 s]
>>
>> [INFO] Spark Project Test Tags  SUCCESS [
>> 0.712 s]
>>
>> [INFO] Spark Project Sketch ... SUCCESS [
>> 0.498 s]
>>
>> [INFO] Spark Project Networking ... SUCCESS [
>> 1.743 s]
>>
>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>> 0.587 s]
>>
>> [INFO] Spark Project Unsafe ... SUCCESS [
>> 0.503 s]
>>
>> [INFO] Spark Project Launcher . SUCCESS [
>> 4.894 s]
>>
>> [INFO] Spark Project Core . SUCCESS [
>> 17.953 s]
>>
>> [INFO] Spark Project GraphX ... SUCCESS [
>> 3.480 s]
>>
>> [INFO] Spark Project Streaming  SUCCESS [
>> 6.022 s]
>>
>> [INFO] Spark Project Catalyst . SUCCESS [
>> 8.664 s]
>>
>> [INFO] Spark Project SQL .. SUCCESS [
>> 12.440 s]
>>
>> [INFO] Spark Project ML Local Library . SUCCESS [
>> 0.498 s]
>>
>> [INFO] Spark Project ML Library ... SUCCESS [
>> 8.594 s]
>>
>> [INFO] Spark Project Tools  SUCCESS [
>> 0.162 s]
>>
>> [INFO] Spark Project Hive . SUCCESS [
>> 9.834 s]
>>
>> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
>> 1.428 s]
>>
>> [INFO] Spark Project Docker Integration Tests . FAILURE
>> [02:32 min]
>>
>> [INFO] Spark Project REPL . SKIPPED
>>
>> [INFO] Spark Project Assembly . SKIPPED
>>
>> [INFO] Spark Project External Flume Sink .. SKIPPED
>>
>> [INFO] Spark Project External Flume ... SKIPPED
>>
>> [INFO] Spark Project External Flume Assembly .. SKIPPED
>>
>> [INFO] Spark Project External Kafka ... SKIPPED
>>
>> [INFO] Spark Project Examples . SKIPPED
>>
>> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>>
>> [INFO] Spark Project Java 8 Tests . SKIPPED
>>
>> [INFO]
>> 
>>
>> [INFO] BUILD FAILURE
>>
>> [INFO]
>> 
>>
>> [INFO] Total time: 03:53 min
>>
>> [INFO] Finished at: 2016-05-02T17:44:57-07:00
>>
>> [INFO] Final Memory: 80M/1525M
>>
>> [INFO]
>> 
>>
>> [ERROR] Failed to execute goal on project
>> spark-docker-integration-tests_2.11: Could not resolve dependencies for
>> project
>> org.apache.spark:spark-docker-integration-tests_2.11:jar:2.0.0-SNAPSHOT:
>> Failed to collect dependencies at com.oracle:ojdbc6:jar:11.2.0.1.0: Failed
>> to read artifact descriptor for com.oracle:ojdbc6:jar:11.2.0.1.0: Could not
>> transfer artifact com.oracle:ojdbc6:pom:11.2.0.1.0 from/to uber-artifactory
>> (http://artifactory.uber.internal:4587/artifactory/repo/): Connect to
>> artifactory.uber.internal:4587 [artifactory.uber.internal/10.162.11.61]
>> failed: Operation timed out -> [Help 1]
>>
>>


-- 
Regards,


Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Ted Yu
>From the output of dependency:tree of master branch:

[INFO]

[INFO] Building Spark Project Docker Integration Tests 2.0.0-SNAPSHOT
[INFO]

[WARNING] The POM for com.oracle:ojdbc6:jar:11.2.0.1.0 is missing, no
dependency information available
[INFO]
...
[INFO] +- com.oracle:ojdbc6:jar:11.2.0.1.0:test

Are you building behind a proxy ?

On Mon, May 2, 2016 at 5:51 PM, Hien Luu  wrote:

> Hi all,
>
> I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.
> It kept getting "Operation timed out" while building Spark Project Docker
> Integration Tests module (see the error below).
>
> Has anyone run this problem before? If so, how did you resolve around this
> problem?
>
> [INFO] Reactor Summary:
>
> [INFO]
>
> [INFO] Spark Project Parent POM ... SUCCESS [
> 2.423 s]
>
> [INFO] Spark Project Test Tags  SUCCESS [
> 0.712 s]
>
> [INFO] Spark Project Sketch ... SUCCESS [
> 0.498 s]
>
> [INFO] Spark Project Networking ... SUCCESS [
> 1.743 s]
>
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 0.587 s]
>
> [INFO] Spark Project Unsafe ... SUCCESS [
> 0.503 s]
>
> [INFO] Spark Project Launcher . SUCCESS [
> 4.894 s]
>
> [INFO] Spark Project Core . SUCCESS [
> 17.953 s]
>
> [INFO] Spark Project GraphX ... SUCCESS [
> 3.480 s]
>
> [INFO] Spark Project Streaming  SUCCESS [
> 6.022 s]
>
> [INFO] Spark Project Catalyst . SUCCESS [
> 8.664 s]
>
> [INFO] Spark Project SQL .. SUCCESS [
> 12.440 s]
>
> [INFO] Spark Project ML Local Library . SUCCESS [
> 0.498 s]
>
> [INFO] Spark Project ML Library ... SUCCESS [
> 8.594 s]
>
> [INFO] Spark Project Tools  SUCCESS [
> 0.162 s]
>
> [INFO] Spark Project Hive . SUCCESS [
> 9.834 s]
>
> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
> 1.428 s]
>
> [INFO] Spark Project Docker Integration Tests . FAILURE [02:32
> min]
>
> [INFO] Spark Project REPL . SKIPPED
>
> [INFO] Spark Project Assembly . SKIPPED
>
> [INFO] Spark Project External Flume Sink .. SKIPPED
>
> [INFO] Spark Project External Flume ... SKIPPED
>
> [INFO] Spark Project External Flume Assembly .. SKIPPED
>
> [INFO] Spark Project External Kafka ... SKIPPED
>
> [INFO] Spark Project Examples . SKIPPED
>
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>
> [INFO] Spark Project Java 8 Tests . SKIPPED
>
> [INFO]
> 
>
> [INFO] BUILD FAILURE
>
> [INFO]
> 
>
> [INFO] Total time: 03:53 min
>
> [INFO] Finished at: 2016-05-02T17:44:57-07:00
>
> [INFO] Final Memory: 80M/1525M
>
> [INFO]
> 
>
> [ERROR] Failed to execute goal on project
> spark-docker-integration-tests_2.11: Could not resolve dependencies for
> project
> org.apache.spark:spark-docker-integration-tests_2.11:jar:2.0.0-SNAPSHOT:
> Failed to collect dependencies at com.oracle:ojdbc6:jar:11.2.0.1.0: Failed
> to read artifact descriptor for com.oracle:ojdbc6:jar:11.2.0.1.0: Could not
> transfer artifact com.oracle:ojdbc6:pom:11.2.0.1.0 from/to uber-artifactory
> (http://artifactory.uber.internal:4587/artifactory/repo/): Connect to
> artifactory.uber.internal:4587 [artifactory.uber.internal/10.162.11.61]
> failed: Operation timed out -> [Help 1]
>
>


Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread रविशंकर नायर
Oracle jdbc is not part of Maven repository,  are you keeping a downloaded
file in your local repo?

Best, RS
On May 2, 2016 8:51 PM, "Hien Luu"  wrote:

> Hi all,
>
> I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.
> It kept getting "Operation timed out" while building Spark Project Docker
> Integration Tests module (see the error below).
>
> Has anyone run this problem before? If so, how did you resolve around this
> problem?
>
> [INFO] Reactor Summary:
>
> [INFO]
>
> [INFO] Spark Project Parent POM ... SUCCESS [
> 2.423 s]
>
> [INFO] Spark Project Test Tags  SUCCESS [
> 0.712 s]
>
> [INFO] Spark Project Sketch ... SUCCESS [
> 0.498 s]
>
> [INFO] Spark Project Networking ... SUCCESS [
> 1.743 s]
>
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 0.587 s]
>
> [INFO] Spark Project Unsafe ... SUCCESS [
> 0.503 s]
>
> [INFO] Spark Project Launcher . SUCCESS [
> 4.894 s]
>
> [INFO] Spark Project Core . SUCCESS [
> 17.953 s]
>
> [INFO] Spark Project GraphX ... SUCCESS [
> 3.480 s]
>
> [INFO] Spark Project Streaming  SUCCESS [
> 6.022 s]
>
> [INFO] Spark Project Catalyst . SUCCESS [
> 8.664 s]
>
> [INFO] Spark Project SQL .. SUCCESS [
> 12.440 s]
>
> [INFO] Spark Project ML Local Library . SUCCESS [
> 0.498 s]
>
> [INFO] Spark Project ML Library ... SUCCESS [
> 8.594 s]
>
> [INFO] Spark Project Tools  SUCCESS [
> 0.162 s]
>
> [INFO] Spark Project Hive . SUCCESS [
> 9.834 s]
>
> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
> 1.428 s]
>
> [INFO] Spark Project Docker Integration Tests . FAILURE [02:32
> min]
>
> [INFO] Spark Project REPL . SKIPPED
>
> [INFO] Spark Project Assembly . SKIPPED
>
> [INFO] Spark Project External Flume Sink .. SKIPPED
>
> [INFO] Spark Project External Flume ... SKIPPED
>
> [INFO] Spark Project External Flume Assembly .. SKIPPED
>
> [INFO] Spark Project External Kafka ... SKIPPED
>
> [INFO] Spark Project Examples . SKIPPED
>
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>
> [INFO] Spark Project Java 8 Tests . SKIPPED
>
> [INFO]
> 
>
> [INFO] BUILD FAILURE
>
> [INFO]
> 
>
> [INFO] Total time: 03:53 min
>
> [INFO] Finished at: 2016-05-02T17:44:57-07:00
>
> [INFO] Final Memory: 80M/1525M
>
> [INFO]
> 
>
> [ERROR] Failed to execute goal on project
> spark-docker-integration-tests_2.11: Could not resolve dependencies for
> project
> org.apache.spark:spark-docker-integration-tests_2.11:jar:2.0.0-SNAPSHOT:
> Failed to collect dependencies at com.oracle:ojdbc6:jar:11.2.0.1.0: Failed
> to read artifact descriptor for com.oracle:ojdbc6:jar:11.2.0.1.0: Could not
> transfer artifact com.oracle:ojdbc6:pom:11.2.0.1.0 from/to uber-artifactory
> (http://artifactory.uber.internal:4587/artifactory/repo/): Connect to
> artifactory.uber.internal:4587 [artifactory.uber.internal/10.162.11.61]
> failed: Operation timed out -> [Help 1]
>
>


Error from reading S3 in Scala

2016-05-02 Thread Zhang, Jingyu
Hi All,

I am using Eclipse with Maven for developing Spark applications. I got a
error for Reading from S3 in Scala but it works fine in Java when I run
them in the same project in Eclipse. The Scala/Java code and the error in
following


Scala

val uri = URI.create("s3a://" + key + ":" + seckey + "@" +
"graphclustering/config.properties");
val pt = new Path("s3a://" + key + ":" + seckey + "@" +
"graphclustering/config.properties");
val fs = FileSystem.get(uri,ctx.hadoopConfiguration);
val  inputStream:InputStream = fs.open(pt);

Exception: on aws-java-1.7.4 and hadoop-aws-2.6.1

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception:
Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;
Request ID: 8A56DC7BF0BFF09A), S3 Extended Request ID

at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(
AmazonHttpClient.java:1160)

at com.amazonaws.http.AmazonHttpClient.executeOneRequest(
AmazonHttpClient.java:748)

at com.amazonaws.http.AmazonHttpClient.executeHelper(
AmazonHttpClient.java:467)

at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302)

at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)

at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
AmazonS3Client.java:1050)

at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
AmazonS3Client.java:1027)

at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(
S3AFileSystem.java:688)

at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:222)

at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)

at com.news.report.graph.GraphCluster$.main(GraphCluster.scala:53)

at com.news.report.graph.GraphCluster.main(GraphCluster.scala)

16/05/03 10:49:17 INFO SparkContext: Invoking stop() from shutdown hook

16/05/03 10:49:17 INFO SparkUI: Stopped Spark web UI at
http://10.65.80.125:4040

16/05/03 10:49:17 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!

16/05/03 10:49:17 INFO MemoryStore: MemoryStore cleared

16/05/03 10:49:17 INFO BlockManager: BlockManager stopped

Exception: on aws-java-1.7.4 and hadoop-aws-2.7.2

16/05/03 10:23:40 INFO Slf4jLogger: Slf4jLogger started

16/05/03 10:23:40 INFO Remoting: Starting remoting

16/05/03 10:23:40 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriverActorSystem@10.65.80.125:61860]

16/05/03 10:23:40 INFO Utils: Successfully started service
'sparkDriverActorSystem' on port 61860.

16/05/03 10:23:40 INFO SparkEnv: Registering MapOutputTracker

16/05/03 10:23:40 INFO SparkEnv: Registering BlockManagerMaster

16/05/03 10:23:40 INFO DiskBlockManager: Created local directory at
/private/var/folders/sc/tdmkbvr1705b8p70xqj1kqks5l9p

16/05/03 10:23:40 INFO MemoryStore: MemoryStore started with capacity
1140.4 MB

16/05/03 10:23:40 INFO SparkEnv: Registering OutputCommitCoordinator

16/05/03 10:23:40 INFO Utils: Successfully started service 'SparkUI' on
port 4040.

16/05/03 10:23:40 INFO SparkUI: Started SparkUI at http://10.65.80.125:4040

16/05/03 10:23:40 INFO Executor: Starting executor ID driver on host
localhost

16/05/03 10:23:40 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 61861.

16/05/03 10:23:40 INFO NettyBlockTransferService: Server created on 61861

16/05/03 10:23:40 INFO BlockManagerMaster: Trying to register BlockManager

16/05/03 10:23:40 INFO BlockManagerMasterEndpoint: Registering block
manager localhost:61861 with 1140.4 MB RAM, BlockManagerId(driver,
localhost, 61861)

16/05/03 10:23:40 INFO BlockManagerMaster: Registered BlockManager

Exception in thread "main" java.lang.NoSuchMethodError:
com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V

at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:285)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at com.news.report.graph.GraphCluster$.main(GraphCluster.scala:52)

at com.news.report.graph.GraphCluster.main(GraphCluster.scala)

16/05/03 10:23:51 INFO SparkContext: Invoking stop() from shutdown hook

16/05/03 10:23:51 INFO SparkUI: Stopped Spark web UI at
http://10.65.80.125:4040

16/05/03 10:23:51 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!

16/05/03 10:23:51 INFO MemoryStore: MemoryStore cleared

16/05/03 10:23:51 INFO BlockManager: BlockManager stopped

16/05/03 10:23:51 INFO BlockManagerMaster: BlockManagerMaster stopped

16/05/03 10:23:51 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!

16/05/03 10:23:51 INFO SparkContext: Successfully stopped SparkContext


Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Hien Luu
Hi all,

I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.
It kept getting "Operation timed out" while building Spark Project Docker
Integration Tests module (see the error below).

Has anyone run this problem before? If so, how did you resolve around this
problem?

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [
2.423 s]

[INFO] Spark Project Test Tags  SUCCESS [
0.712 s]

[INFO] Spark Project Sketch ... SUCCESS [
0.498 s]

[INFO] Spark Project Networking ... SUCCESS [
1.743 s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
0.587 s]

[INFO] Spark Project Unsafe ... SUCCESS [
0.503 s]

[INFO] Spark Project Launcher . SUCCESS [
4.894 s]

[INFO] Spark Project Core . SUCCESS [
17.953 s]

[INFO] Spark Project GraphX ... SUCCESS [
3.480 s]

[INFO] Spark Project Streaming  SUCCESS [
6.022 s]

[INFO] Spark Project Catalyst . SUCCESS [
8.664 s]

[INFO] Spark Project SQL .. SUCCESS [
12.440 s]

[INFO] Spark Project ML Local Library . SUCCESS [
0.498 s]

[INFO] Spark Project ML Library ... SUCCESS [
8.594 s]

[INFO] Spark Project Tools  SUCCESS [
0.162 s]

[INFO] Spark Project Hive . SUCCESS [
9.834 s]

[INFO] Spark Project HiveContext Compatibility  SUCCESS [
1.428 s]

[INFO] Spark Project Docker Integration Tests . FAILURE [02:32
min]

[INFO] Spark Project REPL . SKIPPED

[INFO] Spark Project Assembly . SKIPPED

[INFO] Spark Project External Flume Sink .. SKIPPED

[INFO] Spark Project External Flume ... SKIPPED

[INFO] Spark Project External Flume Assembly .. SKIPPED

[INFO] Spark Project External Kafka ... SKIPPED

[INFO] Spark Project Examples . SKIPPED

[INFO] Spark Project External Kafka Assembly .. SKIPPED

[INFO] Spark Project Java 8 Tests . SKIPPED

[INFO]


[INFO] BUILD FAILURE

[INFO]


[INFO] Total time: 03:53 min

[INFO] Finished at: 2016-05-02T17:44:57-07:00

[INFO] Final Memory: 80M/1525M

[INFO]


[ERROR] Failed to execute goal on project
spark-docker-integration-tests_2.11: Could not resolve dependencies for
project
org.apache.spark:spark-docker-integration-tests_2.11:jar:2.0.0-SNAPSHOT:
Failed to collect dependencies at com.oracle:ojdbc6:jar:11.2.0.1.0: Failed
to read artifact descriptor for com.oracle:ojdbc6:jar:11.2.0.1.0: Could not
transfer artifact com.oracle:ojdbc6:pom:11.2.0.1.0 from/to uber-artifactory
(http://artifactory.uber.internal:4587/artifactory/repo/): Connect to
artifactory.uber.internal:4587 [artifactory.uber.internal/10.162.11.61]
failed: Operation timed out -> [Help 1]


RE: Spark standalone workers, executors and JVMs

2016-05-02 Thread Mohammed Guller
The workers and executors run as separate JVM processes in the standalone mode.

The use of multiple workers on a single machine depends on how you will be 
using the clusters. If you run multiple Spark applications simultaneously, each 
application gets its own its executor. So, for example, if you allocate 8GB to 
each application, you can run 192/8 Spark applications simultaneously (assuming 
you also have large number of cores). Each executor has only 8GB heap, so GC 
should not be issue. Alternatively, if you know that you will have few 
applications running simultaneously on that cluster, running multiple workers 
on each machine will allow you to avoid GC issues associated with allocating 
large heap to a single JVM process. This option allows you to run multiple 
executors for an application on a single machine and each executor can be 
configured with optimal memory.


Mohammed
Author: Big Data Analytics with 
Spark

From: Simone Franzini [mailto:captainfr...@gmail.com]
Sent: Monday, May 2, 2016 9:27 AM
To: user
Subject: Fwd: Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in 
standalone mode.
Are worker processes and executors independent JVMs or do executors run within 
the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying 
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers 
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker 
(--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini



Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hello again. I searched for "backport kafka" in the list archives but
couldn't find anything but a post from Spark 0.7.2 . I was going to
use accumulators to make a counter, but then saw on the Streaming tab
the Receiver Statistics. Then I removed all other "functionality"
except:


JavaPairReceiverInputDStream dstream = KafkaUtils
  //createStream(JavaStreamingContext jssc,Class
keyTypeClass,Class valueTypeClass, Class keyDecoderClass,
Class valueDecoderClass, java.util.Map kafkaParams,
java.util.Map topics, StorageLevel storageLevel)
  .createStream(jssc, byte[].class, byte[].class,
kafka.serializer.DefaultDecoder.class,
kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
StorageLevel.MEMORY_AND_DISK_SER());

   dstream.print();

Then in the Recieiver Stats for the single receiver, I'm seeing around
380 records / second. Then to get anywhere near my 10% mentioned
above, I'd need to run around 21 receivers, assuming 380 records /
second, just using the print output. This seems awfully high to me,
considering that I wrote 8+ records a second to Kafka from a
mapreduce job, and that my bottleneck was likely Hbase. Again using
the 380 estimate, I would need 200+ receivers to reach a similar
amount of reads.

Even given the issues with the 1.2 receivers, is this the expected way
to use the Kafka streaming API, or am I doing something terribly
wrong?

My application looks like
https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using spark 1.2, or is upgrading possible?  The
> kafka direct stream is available starting with 1.3.  If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
> wrote:
>> I've written an application to get content from a kafka topic with 1.7
>> billion entries,  get the protobuf serialized entries, and insert into
>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>
>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>> 0-2500 writes / second. This will take much too long to consume the
>> entries.
>>
>> I currently believe that the spark kafka receiver is the bottleneck.
>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>> notice any large performance difference. I've tried many different
>> spark configuration options, but can't seem to get better performance.
>>
>> I saw 8 requests / second inserting these records into kafka using
>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>
>> While hbase inserts might not deliver the same throughput, I'd like to
>> at least get 10%.
>>
>> My application looks like
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> This is my first spark application. I'd appreciate any assistance.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Redirect from yarn to spark history server

2016-05-02 Thread Marcelo Vanzin
See http://spark.apache.org/docs/latest/running-on-yarn.html,
especially the parts that talk about
spark.yarn.historyServer.address.

On Mon, May 2, 2016 at 2:14 PM, satish saley  wrote:
>
>
> Hello,
>
> I am running pyspark job using yarn-cluster mode. I can see spark job in
> yarn but I am able to go from any "log history" link from yarn to spark
> history server. How would I keep track of yarn log and its corresponding log
> in spark history server? Is there any setting in yarn/spark that let me
> redirect to spark history server from yarn?
>
> Best,
> Satish



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Redirect from yarn to spark history server

2016-05-02 Thread satish saley
Hello,

I am running pyspark job using yarn-cluster mode. I can see spark job in
yarn but I am able to go from any "log history" link from yarn to spark
history server. How would I keep track of yarn log and its corresponding
log in spark history server? Is there any setting in yarn/spark that let me
redirect to spark history server from yarn?

Best,
Satish


Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong,

Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.

>From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts.  They are basically pulling
out selected columns from the query, but there is no roll up happening or
anything that would possible make it suspicious that there is any
difference besides the type of joins.  The tables are matched based on
three keys that are present in both tables (ad, account, and date), on top
of this they are filtered by date being above 2016-01-03.  Since all the
joins are producing the same counts, the natural suspicions is that the
tables are identical, but I when I run the following two queries:

scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>='2016-01-03'").count

res14: Long = 34158

scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>='2016-01-03'").count

res15: Long = 42693


The above two queries filter out the data based on date used by the joins
of 2016-01-03 and you can see the row count between the two tables are
different, which is why I am suspecting something is wrong with the outer
joins in spark sql, because in this situation the right and outer joins may
produce the same results, but it should not be equal to the left join and
definitely not the inner join; unless I am missing something.


Side note: In my haste response above I posted the wrong counts for
dps.count, the real value is res16: Long = 42694


Thanks,


KP



On Mon, May 2, 2016 at 12:50 PM, Yong Zhang  wrote:

> We are still not sure what is the problem, if you cannot show us with some
> example data.
>
> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
> with swig on 3 columns; with additional filters, get the same resultSet row
> count as dps lefter outer join with swig on 3 columns, with additional
> filters, again get the the same resultSet row count as dps right outer join
> with swig on 3 columns, with same additional filters.
>
> Without knowing your data, I cannot see the reason that has to be a bug in
> the spark.
>
> Am I misunderstanding your bug?
>
> Yong
>
> --
> From: kpe...@gmail.com
> Date: Mon, 2 May 2016 12:11:18 -0700
> Subject: Re: Weird results with Spark SQL Outer joins
> To: gourav.sengu...@gmail.com
> CC: user@spark.apache.org
>
>
> Gourav,
>
> I wish that was case, but I have done a select count on each of the two
> tables individually and they return back different number of rows:
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
>
> dps.count()
> RESULT: 42632
>
>
> swig.count()
> RESULT: 42034
>
> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
>
> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


Re: Number of executors change during job running

2016-05-02 Thread Vikash Pareek
Hi Bill,

You can try DirectStream and increase # of partition to kafka. then input
Dstream will have the partitions as per kafka topic without using
re-partitioning.

Can you please share your event timeline chart from spark ui. You need to
tune your configuration as per computation. Spark ui will give deeper
understanding of the problem.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Number-of-executors-change-during-job-running-tp9243p26866.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



QueryExecution to String breaks with OOM

2016-05-02 Thread Brandon White
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at scala.StringContext.standardInterpolator(StringContext.scala:123)
at scala.StringContext.s(StringContext.scala:90)
at
org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:947)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)

I have a dataframe which I am running a ton of filters on. When I try to
save it, my job runs out of memory.

Any idea how can I fix this?


RE: Weird results with Spark SQL Outer joins

2016-05-02 Thread Yong Zhang
We are still not sure what is the problem, if you cannot show us with some 
example data.
For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with 
swig on 3 columns; with additional filters, get the same resultSet row count as 
dps lefter outer join with swig on 3 columns, with additional filters, again 
get the the same resultSet row count as dps right outer join with swig on 3 
columns, with same additional filters.
Without knowing your data, I cannot see the reason that has to be a bug in the 
spark.
Am I misunderstanding your bug?
Yong

From: kpe...@gmail.com
Date: Mon, 2 May 2016 12:11:18 -0700
Subject: Re: Weird results with Spark SQL Outer joins
To: gourav.sengu...@gmail.com
CC: user@spark.apache.org

Gourav,
I wish that was case, but I have done a select count on each of the two tables 
individually and they return back different number of rows:









dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")




dps.count()

RESULT: 42632




swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta  
wrote:
This shows that both the tables have matching records and no mismatches. 
Therefore obviously you have the same results irrespective of whether you use 
right or left join. 
I think that there is no problem here, unless I am missing something.
Regards,Gourav 
On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
Also, the results of the inner query produced the same results:

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS

d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,

d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN

dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =

d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()

RESULT:23747







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






  

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav,

I wish that was case, but I have done a select count on each of the two
tables individually and they return back different number of rows:


dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")


dps.count()

RESULT: 42632


swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta 
wrote:

> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


how to orderBy previous groupBy.count.orderBy in pyspark

2016-05-02 Thread webe3vt
I have the following simple example that I can't get to work correctly.

In [1]:

from pyspark.sql import SQLContext, Row
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType
from pyspark.sql.functions import asc, desc, sum, count
sqlContext = SQLContext(sc)

error_schema = StructType([
StructField('id', IntegerType(), nullable=False),
StructField('error_code', IntegerType(),
nullable=False),
StructField('error_desc', StringType(),
nullable=False)
])
error_data = sc.parallelize([
Row(1, 1, 'type 1 error'),
Row(1, 2, 'type 2 error'),
Row(2, 4, 'type 4 error'),
Row(2, 3, 'type 3 error'),
Row(2, 3, 'type 3 error'),
Row(2, 2, 'type 2 error'),
Row(2, 1, 'type 1 error'),
Row(3, 2, 'type 2 error'),
Row(3, 2, 'type 2 error'),
Row(3, 2, 'type 2 error'),
Row(3, 1, 'type 1 error'),
Row(3, 3, 'type 3 error'),
Row(3, 1, 'type 1 error'),
Row(3, 1, 'type 1 error'),
Row(3, 4, 'type 4 error'),
Row(3, 5, 'type 5 error'),
Row(3, 1, 'type 1 error'),
Row(3, 1, 'type 1 error'),
Row(3, 2, 'type 2 error'),
Row(3, 4, 'type 4 error'),
Row(3, 1, 'type 1 error'),

])
error_df = sqlContext.createDataFrame(error_data, error_schema)
error_df.show()
id_count =
error_df.groupBy(error_df["id"]).count().orderBy(desc("count"))
id_count.show()
error_df.groupBy(error_df["id"], error_df["error_code"],
error_df["error_desc"]).count().orderBy(id_count["id"],
desc("count")).show(20)

+---+--++
| id|error_code|  error_desc|
+---+--++
|  1| 1|type 1 error|
|  1| 2|type 2 error|
|  2| 4|type 4 error|
|  2| 3|type 3 error|
|  2| 3|type 3 error|
|  2| 2|type 2 error|
|  2| 1|type 1 error|
|  3| 2|type 2 error|
|  3| 2|type 2 error|
|  3| 2|type 2 error|
|  3| 1|type 1 error|
|  3| 3|type 3 error|
|  3| 1|type 1 error|
|  3| 1|type 1 error|
|  3| 4|type 4 error|
|  3| 5|type 5 error|
|  3| 1|type 1 error|
|  3| 1|type 1 error|
|  3| 2|type 2 error|
|  3| 4|type 4 error|
+---+--++
only showing top 20 rows

+---+-+
| id|count|
+---+-+
|  3|   14|
|  2|5|
|  1|2|
+---+-+

+---+--++-+
| id|error_code|  error_desc|count|
+---+--++-+
|  1| 1|type 1 error|1|
|  1| 2|type 2 error|1|
|  2| 3|type 3 error|2|
|  2| 2|type 2 error|1|
|  2| 1|type 1 error|1|
|  2| 4|type 4 error|1|
|  3| 1|type 1 error|6|
|  3| 2|type 2 error|4|
|  3| 4|type 4 error|2|
|  3| 3|type 3 error|1|
|  3| 5|type 5 error|1|
+---+--++-+


In []:

What I would like is to end up with that last table ordered by the ids
that have the largest error count and within each id descending by
count.  I would like the end result to be like this.

+---+--++-+
| id|error_code|  error_desc|count|
+---+--++-+
|  3| 1|type 1 error|6|
|  3| 2|type 2 error|4|
|  3| 4|type 4 error|2|
|  3| 3|type 3 error|1|
|  3| 5|type 5 error|1|
|  2| 3|type 3 error|2|
|  2| 2|type 2 error|1|
|  2| 1|type 1 error|1|
|  2| 4|type 4 error|1|
|  1| 1|type 1 error|1|
|  1| 2|type 2 error|1|
+---+--++-+

Because id 3 has the highest error count, id 2 the next highest, 1 the
least error count.

What is the best way to do this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-orderBy-previous-groupBy-count-orderBy-in-pyspark-tp26864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
This shows that both the tables have matching records and no mismatches.
Therefore obviously you have the same results irrespective of whether you
use right or left join.

I think that there is no problem here, unless I am missing something.

Regards,
Gourav

On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:

> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread kpeng1
Also, the results of the inner query produced the same results:
sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count() 
RESULT:23747 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Adding back user@spark.

Since the top of stack trace is in Datastax class(es), I suggest polling on
their mailing list.

On Mon, May 2, 2016 at 11:29 AM, Piyush Verma 
wrote:

> Hmm weird. They show up on the Web interface.
>
> Wait, got it. Its wrapped up Inside the < raw >..< /raw > so Text-only
> mail clients prune what’s inside.
> Anyway here’s the text again. (Inline)
>
> > On 02-May-2016, at 23:56, Ted Yu  wrote:
> >
> > Maybe you were trying to embed pictures for the error and your code -
> but they didn't go through.
> >
> > On Mon, May 2, 2016 at 10:32 AM, meson10  wrote:
> > Hi,
> >
> > I am trying to save a RDD to Cassandra but I am running into the
> following
> > error:
>
> [{'key': 3, 'value': 'foobar'}]
>
> [Stage 9:>  (0 +
> 2) / 2]
> [Stage 9:=> (1 +
> 1) / 2]WARN  2016-05-02 17:23:55,240
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 9.0 (TID
> 11, 10.0.6.200): java.lang.NullPointerException
> at com.datastax.bdp.spark.python.RDDPythonFunctions.com
> $datastax$bdp$spark$python$RDDPythonFunctions$$toCassandraRow(RDDPythonFunctions.scala:57)
> at
> com.datastax.bdp.spark.python.RDDPythonFunctions$$anonfun$toCassandraRows$1.apply(RDDPythonFunctions.scala:73)
> at
> com.datastax.bdp.spark.python.RDDPythonFunctions$$anonfun$toCassandraRows$1.apply(RDDPythonFunctions.scala:73)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
> at
> com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
> at
> com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
> at
> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:155)
> at
> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:139)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
> at
> com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
> at
> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
> at
> com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:139)
> at
> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
> at
> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> ERROR 2016-05-02 17:23:55,406 org.apache.spark.scheduler.TaskSetManager:
> Task 1 in stage 9.0 failed 4 times; aborting job
> Traceback (most recent call last):
>   File "/home/ubuntu/test-spark.py", line 50, in 
> main()
>   File "/home/ubuntu/test-spark.py", line 47, in main
> runner.run()
>   File "/home/ubuntu/spark_common.py", line 62, in run
> self.save_logs_to_cassandra()
>   File "/home/ubuntu/spark_common.py", line 142, in save_logs_to_cassandra
> rdd.saveToCassandra(keyspace, tablename)
>   File "/usr/share/dse/spark/python/lib/pyspark.zip/pyspark/rdd.py", line
> 2313, in saveToCassandra
>   File
> "/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o149.saveToCassandra.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1 in stage 9.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 9.0 (TID 14, 10.0.6.200): java.lang.NullPointerException
> at com.datastax.bdp.spark.python.RDDPythonFunctions.com
> $datastax$bdp$spark$python$RDDPythonFunctions$$toCassandraRow(RDDPythonFunctions.scala:57)
> at
> 

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi Kevin,

Thanks.

Please post the result of the same query with INNER JOIN and then it will
give us a bit of insight.

Regards,
Gourav


On Mon, May 2, 2016 at 7:10 PM, Kevin Peng  wrote:

> Gourav,
>
> Apologies.  I edited my post with this information:
> Spark version: 1.6
> Result from spark shell
> OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
> mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014
>
> Thanks,
>
> KP
>
> On Mon, May 2, 2016 at 11:05 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> As always, can you please write down details regarding your SPARK cluster
>> - the version, OS, IDE used, etc?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, May 2, 2016 at 5:58 PM, kpeng1  wrote:
>>
>>> Hi All,
>>>
>>> I am running into a weird result with Spark SQL Outer joins.  The results
>>> for all of them seem to be the same, which does not make sense due to the
>>> data.  Here are the queries that I am running with the results:
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account AS
>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>>> '2016-01-03'").count()
>>> RESULT:23747
>>>
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account AS
>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>>> '2016-01-03'").count()
>>> RESULT:23747
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account AS
>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>>> '2016-01-03'").count()
>>> RESULT: 23747
>>>
>>> Was wondering if someone had encountered this issues before.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Maybe you were trying to embed pictures for the error and your code - but
they didn't go through.

On Mon, May 2, 2016 at 10:32 AM, meson10  wrote:

> Hi,
>
> I am trying to save a RDD to Cassandra but I am running into the following
> error:
>
>
>
> The Python code looks like this:
>
>
> I am using DSE 4.8.6 which runs Spark 1.4.2
>
> I ran through a bunch of existing posts on this mailing lists and have
> already performed the following routines:
>
>  * Ensure that there is no redundant cassandra .jar lying around,
> interfering with the process.
>  * Wiped clean and reinstall DSE to ensure that.
>  * Tried Loading data from Cassandra to ensure that Spark <-> Cassandra
> communication is working. I usedprint
> self.context.cassandraTable(keyspace='test', table='dummy').collect() to
> validate that.
>  * Ensure there are no null values in my dataset that is being written.
>  * The namespace and the table exist in Cassandra using cassandra@cqlsh>
> SELECT * from test.dummy ;
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-while-performing-rdd-SaveToCassandra-tp26862.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
Jorn,

what aspects are you speaking about ?

My response was absolutely pertinent to Jinan because he will not even face
the problem if he used Scala. So it was along the lines of helping a person
to learn fishing that giving him a fish.

And by the way your blinkered and biased response missed the fact that
SPARK WAS WRITTEN AND IS WRITTEN IN SCALA.

Regards,
Gourav

On Mon, May 2, 2016 at 5:14 PM, Jörn Franke  wrote:

> You See oversimplifying here and some of your statements are not correct.
> There are also other aspects to consider. Finally, it would be better to
> support him with the problem, because Spark supports Java. Java and Scala
> run on the same underlying JVM.
>
> On 02 May 2016, at 17:42, Gourav Sengupta 
> wrote:
>
> JAVA does not easily parallelize, JAVA is verbose, uses different classes
> for serializing, and on top of that you are using RDD's instead of
> dataframes.
>
> Should a senior project not have an implied understanding that it should
> be technically superior?
>
> Why not use SCALA?
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 3:37 PM, Jinan Alhajjaj 
> wrote:
>
>>
>> Because I am doing this project for my senior project by using Java.
>> I try s3a URI as this:
>>
>> s3a://accessId:secret@bucket/path
>>
>> It show me an error :
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
>>
>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
>> S3AFileSystem.java:287)
>>
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
>>
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>>
>> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630
>> )
>>
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>>
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>>
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>
>> at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(
>> FileInputFormat.java:256)
>>
>> at org.apache.hadoop.mapred.FileInputFormat.listStatus(
>> FileInputFormat.java:228)
>>
>> at org.apache.hadoop.mapred.FileInputFormat.getSplits(
>> FileInputFormat.java:313)
>>
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>
>> at scala.Option.getOrElse(Option.scala:120)
>>
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>
>> at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
>>
>> at
>> org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:526)
>> --
>> Date: Thu, 28 Apr 2016 11:19:08 +0100
>> Subject: Re: Reading from Amazon S3
>> From: gourav.sengu...@gmail.com
>> To: ste...@hortonworks.com
>> CC: yuzhih...@gmail.com; j.r.alhaj...@hotmail.com; user@spark.apache.org
>>
>>
>> Why would you use JAVA (create a problem and then try to solve it)? Have
>> you tried 

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi Cody,

  I'm going to use an accumulator right now to get an idea of the
throughput. Thanks for mentioning the back ported module. Also it
looks like I missed this section:
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch
from the docs. Then maybe I should try creating multiple streams to
get more throughput?

Thanks,

Colin Williams

On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger  wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using spark 1.2, or is upgrading possible?  The
> kafka direct stream is available starting with 1.3.  If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  
> wrote:
>> I've written an application to get content from a kafka topic with 1.7
>> billion entries,  get the protobuf serialized entries, and insert into
>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>
>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>> 0-2500 writes / second. This will take much too long to consume the
>> entries.
>>
>> I currently believe that the spark kafka receiver is the bottleneck.
>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>> notice any large performance difference. I've tried many different
>> spark configuration options, but can't seem to get better performance.
>>
>> I saw 8 requests / second inserting these records into kafka using
>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>
>> While hbase inserts might not deliver the same throughput, I'd like to
>> at least get 10%.
>>
>> My application looks like
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> This is my first spark application. I'd appreciate any assistance.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi David,

 My current concern is that I'm using a spark hbase bulk put driver
written for Spark 1.2 on the version of CDH my spark / yarn cluster is
running on. Even if I were to run on another Spark cluster, I'm
concerned that I might have issues making the put requests into hbase.
However I should give it a shot if I abandon Spark 1.2, and my current
environment.

Thanks,

Colin Williams

On Mon, May 2, 2016 at 6:06 PM, Krieg, David
 wrote:
> Spark 1.2 is a little old and busted. I think most of the advice you'll get is
> to try to use Spark 1.3 at least, which introduced a new Spark streaming mode
> (direct receiver). The 1.2 Receiver based implementation had a number of
> shortcomings. 1.3 is where the "direct streaming" interface was introduced,
> which is what we use. You'll get more joy the more you upgrade Spark, at least
> to some extent.
>
> David Krieg | Enterprise Software Engineer
> Early Warning
> Direct: 480.426.2171 | Fax: 480.483.4628 | Mobile: 859.227.6173
>
>
> -Original Message-
> From: Colin Kincaid Williams [mailto:disc...@uw.edu]
> Sent: Monday, May 02, 2016 10:55 AM
> To: user@spark.apache.org
> Subject: Improving performance of a kafka spark streaming app
>
> I've written an application to get content from a kafka topic with 1.7 billion
> entries,  get the protobuf serialized entries, and insert into hbase.
> Currently the environment that I'm running in is Spark 1.2.
>
> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
> 0-2500 writes / second. This will take much too long to consume the entries.
>
> I currently believe that the spark kafka receiver is the bottleneck.
> I've tried both 1.2 receivers, with the WAL and without, and didn't notice any
> large performance difference. I've tried many different spark configuration
> options, but can't seem to get better performance.
>
> I saw 8 requests / second inserting these records into kafka using yarn /
> hbase / protobuf / kafka in a bulk fashion.
>
> While hbase inserts might not deliver the same throughput, I'd like to at
> least get 10%.
>
> My application looks like
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_drocsid_b0efa4ff6ff4a7c3c8bb56767d0b6877=CwIBaQ=rtKJL1IoQkrgf7t9D493SuUmYZJqgJmwEhoO6UD_DpY=rWkTz7PE5TRtkkWejPue_zcBxoTQE4f0g8LBaR2mVi8=pVPZ7WXHDTWO7s5u0qQupsWkiaGiv3B50BdtYvOvazo=_FnCXUJfmNKIVqDy046SS5YVP8cpJBQ3ynECFWJkzK8=
>
> This is my first spark application. I'd appreciate any assistance.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading from Amazon S3

2016-05-02 Thread Jörn Franke
You See oversimplifying here and some of your statements are not correct. There 
are also other aspects to consider. Finally, it would be better to support him 
with the problem, because Spark supports Java. Java and Scala run on the same 
underlying JVM.

> On 02 May 2016, at 17:42, Gourav Sengupta  wrote:
> 
> JAVA does not easily parallelize, JAVA is verbose, uses different classes for 
> serializing, and on top of that you are using RDD's instead of dataframes. 
> 
> Should a senior project not have an implied understanding that it should be 
> technically superior?
> 
> Why not use SCALA?
> 
> Regards,
> Gourav
> 
>> On Mon, May 2, 2016 at 3:37 PM, Jinan Alhajjaj  
>> wrote:
>> 
>> Because I am doing this project for my senior project by using Java.
>> I try s3a URI as this:
>>  
>> s3a://accessId:secret@bucket/path
>> 
>> It show me an error :
>> Exception in thread "main" java.lang.NoSuchMethodError: 
>> com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
>> 
>>  at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
>> 
>>  at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
>> 
>>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>> 
>>  at 
>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
>> 
>>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>> 
>>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>> 
>>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>> 
>>  at 
>> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
>> 
>>  at 
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
>> 
>>  at 
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>> 
>>  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> 
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> 
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> 
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> 
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> 
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> 
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> 
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> 
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> 
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> 
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> 
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> 
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> 
>>  at 
>> org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
>> 
>>  at 
>> org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:526)
>> 
>> Date: Thu, 28 Apr 2016 11:19:08 +0100
>> Subject: Re: Reading from Amazon S3
>> From: gourav.sengu...@gmail.com
>> To: ste...@hortonworks.com
>> CC: yuzhih...@gmail.com; j.r.alhaj...@hotmail.com; user@spark.apache.org
>> 
>> 
>> Why would you use JAVA (create a problem and then try to solve it)? Have you 
>> tried using Scala or Python or even R?
>> 
>> Regards,
>> Gourav 
>> 
>> On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran  
>> wrote:
>> 
>> On 26 Apr 2016, at 18:49, Ted Yu 

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav,

Apologies.  I edited my post with this information:
Spark version: 1.6
Result from spark shell
OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014

Thanks,

KP

On Mon, May 2, 2016 at 11:05 AM, Gourav Sengupta 
wrote:

> Hi,
>
> As always, can you please write down details regarding your SPARK cluster
> - the version, OS, IDE used, etc?
>
> Regards,
> Gourav Sengupta
>
> On Mon, May 2, 2016 at 5:58 PM, kpeng1  wrote:
>
>> Hi All,
>>
>> I am running into a weird result with Spark SQL Outer joins.  The results
>> for all of them seem to be the same, which does not make sense due to the
>> data.  Here are the queries that I am running with the results:
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT: 23747
>>
>> Was wondering if someone had encountered this issues before.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just
deserialize)?

Are you limited to using spark 1.2, or is upgrading possible?  The
kafka direct stream is available starting with 1.3.  If you're stuck
on 1.2, I believe there have been some attempts to backport it, search
the mailing list archives.

On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams  wrote:
> I've written an application to get content from a kafka topic with 1.7
> billion entries,  get the protobuf serialized entries, and insert into
> hbase. Currently the environment that I'm running in is Spark 1.2.
>
> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
> 0-2500 writes / second. This will take much too long to consume the
> entries.
>
> I currently believe that the spark kafka receiver is the bottleneck.
> I've tried both 1.2 receivers, with the WAL and without, and didn't
> notice any large performance difference. I've tried many different
> spark configuration options, but can't seem to get better performance.
>
> I saw 8 requests / second inserting these records into kafka using
> yarn / hbase / protobuf / kafka in a bulk fashion.
>
> While hbase inserts might not deliver the same throughput, I'd like to
> at least get 10%.
>
> My application looks like
> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>
> This is my first spark application. I'd appreciate any assistance.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL with large result size

2016-05-02 Thread Gourav Sengupta
Hi,

I have worked on 300GB data by querying it  from CSV (using SPARK CSV)  and
writing it to Parquet format and then querying parquet format to query it
and partition the data and write out individual csv files without any
issues on a single node SPARK cluster installation.

Are you trying to cache in the entire data? What is that you are trying to
achieve in your used case?

Regards,
Gourav

On Mon, May 2, 2016 at 5:59 PM, Ted Yu  wrote:

> That's my interpretation.
>
> On Mon, May 2, 2016 at 9:45 AM, Buntu Dev  wrote:
>
>> Thanks Ted, I thought the avg. block size was already low and less than
>> the usual 128mb. If I need to reduce it further via parquet.block.size, it
>> would mean an increase in the number of blocks and that should increase the
>> number of tasks/executors. Is that the correct way to interpret this?
>>
>> On Mon, May 2, 2016 at 6:21 AM, Ted Yu  wrote:
>>
>>> Please consider decreasing block size.
>>>
>>> Thanks
>>>
>>> > On May 1, 2016, at 9:19 PM, Buntu Dev  wrote:
>>> >
>>> > I got a 10g limitation on the executors and operating on parquet
>>> dataset with block size 70M with 200 blocks. I keep hitting the memory
>>> limits when doing a 'select * from t1 order by c1 limit 100' (ie, 1M).
>>> It works if I limit to say 100k. What are the options to save a large
>>> dataset without running into memory issues?
>>> >
>>> > Thanks!
>>>
>>
>>
>


Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi,

As always, can you please write down details regarding your SPARK cluster -
the version, OS, IDE used, etc?

Regards,
Gourav Sengupta

On Mon, May 2, 2016 at 5:58 PM, kpeng1  wrote:

> Hi All,
>
> I am running into a weird result with Spark SQL Outer joins.  The results
> for all of them seem to be the same, which does not make sense due to the
> data.  Here are the queries that I am running with the results:
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT: 23747
>
> Was wondering if someone had encountered this issues before.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
I've written an application to get content from a kafka topic with 1.7
billion entries,  get the protobuf serialized entries, and insert into
hbase. Currently the environment that I'm running in is Spark 1.2.

With 8 executors and 2 cores, and 2 jobs, I'm only getting between
0-2500 writes / second. This will take much too long to consume the
entries.

I currently believe that the spark kafka receiver is the bottleneck.
I've tried both 1.2 receivers, with the WAL and without, and didn't
notice any large performance difference. I've tried many different
spark configuration options, but can't seem to get better performance.

I saw 8 requests / second inserting these records into kafka using
yarn / hbase / protobuf / kafka in a bulk fashion.

While hbase inserts might not deliver the same throughput, I'd like to
at least get 10%.

My application looks like
https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

This is my first spark application. I'd appreciate any assistance.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread meson10
Hi,

I am trying to save a RDD to Cassandra but I am running into the following
error:



The Python code looks like this:


I am using DSE 4.8.6 which runs Spark 1.4.2

I ran through a bunch of existing posts on this mailing lists and have
already performed the following routines:

 * Ensure that there is no redundant cassandra .jar lying around,
interfering with the process.
 * Wiped clean and reinstall DSE to ensure that.
 * Tried Loading data from Cassandra to ensure that Spark <-> Cassandra
communication is working. I usedprint
self.context.cassandraTable(keyspace='test', table='dummy').collect() to
validate that.
 * Ensure there are no null values in my dataset that is being written.
 * The namespace and the table exist in Cassandra using cassandra@cqlsh>
SELECT * from test.dummy ;



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-while-performing-rdd-SaveToCassandra-tp26862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Weird results with Spark SQL Outer joins

2016-05-02 Thread kpeng1
Hi All,

I am running into a weird result with Spark SQL Outer joins.  The results
for all of them seem to be the same, which does not make sense due to the
data.  Here are the queries that I am running with the results:

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
RESULT:23747


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
RESULT:23747

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
RESULT: 23747

Was wondering if someone had encountered this issues before.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: kafka direct streaming python API fromOffsets

2016-05-02 Thread Cody Koeninger
If you're confused about the type of an argument, you're probably
better off looking at documentation that includes static types:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$

createDirectStream's fromOffsets parameter takes a map from
TopicAndPartition to Long.

There is documentation for a python constructor for TopicAndPartition:

http://spark.apache.org/docs/latest/api/python/_modules/pyspark/streaming/kafka.html#TopicAndPartition


On Mon, May 2, 2016 at 5:54 AM, Tigran Avanesov
 wrote:
> Hi,
>
> I'm trying to start consuming messages from a kafka topic (via direct
> stream) from a given offset.
> The documentation of createDirectStream says:
>
> :param fromOffsets: Per-topic/partition Kafka offsets defining the
> (inclusive) starting
> point of the stream.
>
> However it expects a dictionary of topics (not names...), as i tried to feed
> it something like { 'topic' : {0: 123, 1:234}}, and of course got an
> exception.
> How should I build this fromOffsets parameter?
>
> Documentation does not say anything about it.
> (In general, I think it would be better if the function accepted topic
> names)
>
> Thank you!
>
> Regards,
> Tigran
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL with large result size

2016-05-02 Thread Buntu Dev
Thanks Ted, I thought the avg. block size was already low and less than the
usual 128mb. If I need to reduce it further via parquet.block.size, it
would mean an increase in the number of blocks and that should increase the
number of tasks/executors. Is that the correct way to interpret this?

On Mon, May 2, 2016 at 6:21 AM, Ted Yu  wrote:

> Please consider decreasing block size.
>
> Thanks
>
> > On May 1, 2016, at 9:19 PM, Buntu Dev  wrote:
> >
> > I got a 10g limitation on the executors and operating on parquet dataset
> with block size 70M with 200 blocks. I keep hitting the memory limits when
> doing a 'select * from t1 order by c1 limit 100' (ie, 1M). It works if
> I limit to say 100k. What are the options to save a large dataset without
> running into memory issues?
> >
> > Thanks!
>


Fwd: Spark standalone workers, executors and JVMs

2016-05-02 Thread Simone Franzini
I am still a little bit confused about workers, executors and JVMs in
standalone mode.
Are worker processes and executors independent JVMs or do executors run
within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
worker (--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini


zero-length input partitions from parquet

2016-05-02 Thread Han JU
Hi,

I just found out that we can have lots of empty input partitions when
reading from parquet files.

Sample code as following:

  val hconf = sc.hadoopConfiguration
  val job = new Job(hconf)

  FileInputFormat.setInputPaths(job, new Path("path_to_data"))
  ParquetInputFormat.setReadSupportClass(job,
classOf[AvroReadSupport[MyAvroType]])
  val rdd = new NewHadoopRDD[Void, MyAvroType](
sc,
classOf[ParquetInputFormat[MyAvroType]],
classOf[Void],
classOf[MyAvroType],
job.getConfiguration
  )

  val ctx = rdd.newJobContext(job.getConfiguration, new JobID())
  val inputFormat = new ParquetInputFormat[MyAvroType]()

  inputFormat.getSplits(ctx).asScala.foreach(println)

  val sizes = rdd.mapPartitions { iter =>
List(iter.size).iterator
  }.collect().toList
  sizes.foreach(println)


The splits are ok:

ParquetInputSplit{part: file:/folder/test_file start: 0 end: 33554432
length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 33554432 end:
67108864 length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 67108864 end:
100663296 length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 100663296 end:
106022166 length: 5358870 hosts: [localhost]}

However the partition sizes are:
0
4365522
0
0

Essentially one partition has all the lines.
When reading using spark-sql, all is ok.

I'm using spark 1.6.1 and parquet-avro 1.7.0.

Thanks!
-- 
*JU Han*

Software Engineer @ Teads.tv

+33 061960


Spark standalone workers, executors and JVMs

2016-05-02 Thread captainfranz
I am still a little bit confused about workers, executors and JVMs in
standalone mode.
Are worker processes and executors independent JVMs or do executors run
within the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
worker (--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-workers-executors-and-JVMs-tp26860.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
JAVA does not easily parallelize, JAVA is verbose, uses different classes
for serializing, and on top of that you are using RDD's instead of
dataframes.

Should a senior project not have an implied understanding that it should be
technically superior?

Why not use SCALA?

Regards,
Gourav

On Mon, May 2, 2016 at 3:37 PM, Jinan Alhajjaj 
wrote:

>
> Because I am doing this project for my senior project by using Java.
> I try s3a URI as this:
>
> s3a://accessId:secret@bucket/path
>
> It show me an error :
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
>
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
> S3AFileSystem.java:287)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
>
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>
> at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(
> FileInputFormat.java:256)
>
> at org.apache.hadoop.mapred.FileInputFormat.listStatus(
> FileInputFormat.java:228)
>
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(
> FileInputFormat.java:313)
>
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>
> at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
>
> at org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:526)
> --
> Date: Thu, 28 Apr 2016 11:19:08 +0100
> Subject: Re: Reading from Amazon S3
> From: gourav.sengu...@gmail.com
> To: ste...@hortonworks.com
> CC: yuzhih...@gmail.com; j.r.alhaj...@hotmail.com; user@spark.apache.org
>
>
> Why would you use JAVA (create a problem and then try to solve it)? Have
> you tried using Scala or Python or even R?
>
> Regards,
> Gourav
>
> On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran 
> wrote:
>
>
> On 26 Apr 2016, at 18:49, Ted Yu  wrote:
>
> Looking at the cause of the error, it seems hadoop-aws-xx.jar
> (corresponding to the version of hadoop you use) was missing in classpath.
>
>
> yes, that s3n was moved from hadoop-common to the new hadoop-aws, and
> without realising it broke a lot of things.
>
> you'll need hadoop-aws and jets3t on the classpath
>
> If you are using Hadoop 2.7, I'd recommend s3a instead, which means
> hadoop-aws and the exact same amazon-sdk that comes bundled with the hadoop
> binaries your version of spark is built with (it's a bit brittle API-wise)
>
>
> FYI
>
> On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj 
> wrote:
>
> Hi All,
> I am trying to read a file stored in Amazon S3.
> I wrote this code:

RE: Reading from Amazon S3

2016-05-02 Thread Jinan Alhajjaj

Because I am doing this project for my senior project by using Java.I try s3a 
URI as this: s3a://accessId:secret@bucket/path
It show me an error :Exception in thread "main" java.lang.NoSuchMethodError: 
com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
at 
org.apache.spark.api.java.JavaPairRDD.reduceByKey(JavaPairRDD.scala:526)Date: 
Thu, 28 Apr 2016 11:19:08 +0100
Subject: Re: Reading from Amazon S3
From: gourav.sengu...@gmail.com
To: ste...@hortonworks.com
CC: yuzhih...@gmail.com; j.r.alhaj...@hotmail.com; user@spark.apache.org

Why would you use JAVA (create a problem and then try to solve it)? Have you 
tried using Scala or Python or even R?
Regards,Gourav 
On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran  wrote:









On 26 Apr 2016, at 18:49, Ted Yu  wrote:



Looking at the cause of the error, it seems hadoop-aws-xx.jar (corresponding to 
the version of hadoop you use) was missing in classpath.





yes, that s3n was moved from hadoop-common to the new hadoop-aws, and without 
realising it broke a lot of things.



you'll need hadoop-aws and jets3t on the classpath



If you are using Hadoop 2.7, I'd recommend s3a instead, which means hadoop-aws 
and the exact same amazon-sdk that comes bundled with the hadoop binaries your 
version of spark is built with (it's a bit brittle API-wise)








FYI



On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj 
 wrote:



Hi All,
I am trying to read a file stored in Amazon S3.
I wrote this code:

import java.util.List;
import java.util.Scanner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;

Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
Please consider decreasing block size. 

Thanks

> On May 1, 2016, at 9:19 PM, Buntu Dev  wrote:
> 
> I got a 10g limitation on the executors and operating on parquet dataset with 
> block size 70M with 200 blocks. I keep hitting the memory limits when doing a 
> 'select * from t1 order by c1 limit 100' (ie, 1M). It works if I limit to 
> say 100k. What are the options to save a large dataset without running into 
> memory issues?
> 
> Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



REST API submission and Application ID

2016-05-02 Thread thibault.duperron
Hi,

I tried to monitor spark applications through spark APIs. I can submit new 
application/driver with the REST API (POST 
http://spark-cluster-ip:6066/v1/submissions/create ...). The API return the 
driver's id (submissionId). I can check the driver's status and kill it with 
the same API.

However I need to get the linked application status. I've seen the rest api for 
monitoring application, but how can I find application's id from driver's id ? 
And is there a way to know the port which needs to be requested (4040, 4041 
)

Thanks

Thibault


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



Re: Can not import KafkaProducer in spark streaming job

2016-05-02 Thread fanooos
I could solve the issue but the solution is very weird. 

I run this command cat old_script.py > new_script.py then I submitted the
job using the new script.

This is the second time I face such issue with python script and I have no
explanation to what happened. 

I hope this trick help someone else who face similar situation.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-import-KafkaProducer-in-spark-streaming-job-tp26857p26859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



kafka direct streaming python API fromOffsets

2016-05-02 Thread Tigran Avanesov

Hi,

I'm trying to start consuming messages from a kafka topic (via direct 
stream) from a given offset.

The documentation of createDirectStream says:

:param fromOffsets: Per-topic/partition Kafka offsets defining the 
(inclusive) starting

point of the stream.

However it expects a dictionary of topics (not names...), as i tried to 
feed it something like { 'topic' : {0: 123, 1:234}}, and of course got 
an exception.

How should I build this fromOffsets parameter?

Documentation does not say anything about it.
(In general, I think it would be better if the function accepted topic 
names)


Thank you!

Regards,
Tigran


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL with large result size

2016-05-02 Thread ayan guha
How many executors are you running? Is your partition scheme ensures data
is distributed evenly? It is possible that your data is skewed and one of
the executors failing. Maybe you can try reduce per executor memory and
increase partitions.
On 2 May 2016 14:19, "Buntu Dev"  wrote:

> I got a 10g limitation on the executors and operating on parquet dataset
> with block size 70M with 200 blocks. I keep hitting the memory limits when
> doing a 'select * from t1 order by c1 limit 100' (ie, 1M). It works if
> I limit to say 100k. What are the options to save a large dataset without
> running into memory issues?
>
> Thanks!
>


Re: Spark on AWS

2016-05-02 Thread Gourav Sengupta
Hi,

I agree with Steve, just start using vanilla SPARK EMR.

You can try to see point #4 here for dynamic allocation of executors
https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin
.

Note that dynamic allocation of executors takes a bit of time for the jobs
to start running, therefore you can provide another suggestion to EMR
clusters while starting so that they allocate maximum possible processing
to executors as the EMR clusters start using maximizeResourceAllocation as
mentioned here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html

In case you are trying to load enough data in the spark Master node for
graphing or exploratory analysis using Matlab, seaborn or bokeh its better
to increase the driver memory by recreating spark context.


Regards
Gourav Sengupta



On Mon, May 2, 2016 at 12:54 AM, Teng Qiu  wrote:

> Hi, here we made several optimizations for accessing s3 from spark:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>
> such as:
>
> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133
>
> you can deploy our spark package using our docker image, just simply:
>
> docker run -d --net=host \
>-e START_MASTER="true" \
>-e START_WORKER="true" \
>-e START_WEBAPP="true" \
>-e START_NOTEBOOK="true" \
>registry.opensource.zalan.do/bi/spark:1.6.2-6
>
>
> a jupyter notebook will running on port 
>
>
> have fun
>
> Best,
>
> Teng
>
> 2016-04-29 12:37 GMT+02:00 Steve Loughran :
> >
> > On 28 Apr 2016, at 22:59, Alexander Pivovarov 
> wrote:
> >
> > Spark works well with S3 (read and write). However it's recommended to
> set
> > spark.speculation true (it's expected that some tasks fail if you read
> large
> > S3 folder, so speculation should help)
> >
> >
> >
> > I must disagree.
> >
> > Speculative execution has >1 executor running the query, with whoever
> > finishes first winning.
> > however, "finishes first" is implemented in the output committer, by
> > renaming the attempt's output directory to the final output directory:
> > whoever renames first wins.
> > This relies on rename() being implemented in the filesystem client as an
> > atomic transaction.
> > Unfortunately, S3 doesn't do renames. Instead every file gets copied to
> one
> > of the new name, then the old file deleted; an operation that takes time
> > O(data * files)
> >
> > if you have more than one executor trying to commit the work
> simultaneously,
> > your output will be mess of both executions, without anything detecting
> and
> > reporting it.
> >
> > Where did you find this recommendation to set speculation=true?
> >
> > -Steve
> >
> > see also: https://issues.apache.org/jira/browse/SPARK-10063
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2016-05-02 Thread Kapil Raaj
Hi folks,

I am suddenly seeing :

Error:scalac: bad symbolic reference. A signature in Logging.class refers
to type Logger
in package org.slf4j which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
Logging.class.

How can I investigate and fix it, I am using IntellijIdea?

-- 
-Kapil Rajak 


Re: Error in spark-xml

2016-05-02 Thread Mail.com
Can you try once by creating your own schema file and using it to read the XML.

I had similar issue but got that resolved by custom schema and by specifying 
each attribute in that.

Pradeep


> On May 1, 2016, at 9:45 AM, Hyukjin Kwon  wrote:
> 
> To be more clear,
> 
> If you set the rowTag as "book", then it will produces an exception which is 
> an issue opened here, https://github.com/databricks/spark-xml/issues/92
> 
> Currently it does not support to parse a single element with only a value as 
> a row.
> 
> 
> If you set the rowTag as "bkval", then it should work. I tested the case 
> below to double check.
> 
> If it does not work as below, please open an issue with some information so 
> that I can reproduce.
> 
> 
> I tested the case above with the data below
> 
>   
> bk_113
> bk_114
>   
>   
> bk_114
> bk_116
>   
>   
> bk_115
> bk_116
>   
> 
> 
> 
> I tested this with the codes below
> 
> val path = "path-to-file"
> sqlContext.read
>   .format("xml")
>   .option("rowTag", "bkval")
>   .load(path)
>   .show()
> 
> Thanks!
> 
> 
> 2016-05-01 15:11 GMT+09:00 Hyukjin Kwon :
>> Hi Sourav,
>> 
>> I think it is an issue. XML will assume the element by the rowTag as object.
>> 
>>  Could you please open an issue in 
>> https://github.com/databricks/spark-xml/issues please?
>> 
>> Thanks!
>> 
>> 
>> 2016-05-01 5:08 GMT+09:00 Sourav Mazumder :
>>> Hi,
>>> 
>>> Looks like there is a problem in spark-xml if the xml has multiple 
>>> attributes with no child element.
>>> 
>>> For example say the xml has a nested object as below 
>>> 
>>> bk_113
>>> bk_114
>>>  
>>> 
>>> Now if I create a dataframe starting with rowtag bkval and then I do a 
>>> select on that data frame it gives following error.
>>> 
>>> 
>>> scala.MatchError: ENDDOCUMENT (of class 
>>> com.sun.xml.internal.stream.events.EndDocumentEvent) at 
>>> com.databricks.spark.xml.parsers.StaxXmlParser$.checkEndElement(StaxXmlParser.scala:94)
>>>  at  
>>> com.databricks.spark.xml.parsers.StaxXmlParser$.com$databricks$spark$xml$parsers$StaxXmlParser$$convertObject(StaxXmlParser.scala:295)
>>>  at 
>>> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:58)
>>>  at 
>>> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:46)
>>>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at 
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at 
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at 
>>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at 
>>> scala.collection.Iterator$class.foreach(Iterator.scala:727) at 
>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at 
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at 
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
>>> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
>>> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at 
>>> scala.collection.AbstractIterator.to(Iterator.scala:1157) at 
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
>>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at 
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
>>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>>  at 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>>  at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>>  at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at 
>>> org.apache.spark.scheduler.Task.run(Task.scala:88) at 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>  at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>  at java.lang.Thread.run(Thread.java:745)
>>> 
>>> However if there is only one row like below, it works fine.
>>> 
>>> 
>>> bk_113
>>> 
>>> 
>>> Any workaround ?
>>> 
>>> Regards,
>>> Sourav
> 


?????? Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnottransfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org:unk

2016-05-02 Thread sunday2000
hi, by stopping Zinc server, got this error message:
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 8.649 s
[INFO] Finished at: 2016-05-02T14:19:21+08:00
[INFO] Final Memory: 38M/213M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more
Caused by: javac returned nonzero exit code
at sbt.compiler.JavaCompiler$JavaTool0.compile(JavaCompiler.scala:77)
at sbt.compiler.JavaTool$class.apply(JavaCompiler.scala:35)
at sbt.compiler.JavaCompiler$JavaTool0.apply(JavaCompiler.scala:63)
at sbt.compiler.JavaCompiler$class.compile(JavaCompiler.scala:21)
at sbt.compiler.JavaCompiler$JavaTool0.compile(JavaCompiler.scala:63)
at 
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileJava$1$1.apply$mcV$sp(AggressiveCompile.scala:127)
at 
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileJava$1$1.apply(AggressiveCompile.scala:127)
at 
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileJava$1$1.apply(AggressiveCompile.scala:127)
at 
sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:166)
at 
sbt.compiler.AggressiveCompile$$anonfun$3.compileJava$1(AggressiveCompile.scala:126)
at 
sbt.compiler.AggressiveCompile$$anonfun$3.apply(AggressiveCompile.scala:143)
at 
sbt.compiler.AggressiveCompile$$anonfun$3.apply(AggressiveCompile.scala:87)
at 
sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:39)
at 
sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:37)
at sbt.inc.IncrementalCommon.cycle(Incremental.scala:99)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:38)
at sbt.inc.Incremental$$anonfun$1.apply(Incremental.scala:37)
at sbt.inc.Incremental$.manageClassfiles(Incremental.scala:65)
at sbt.inc.Incremental$.compile(Incremental.scala:37)
at sbt.inc.IncrementalCompile$.apply(Compile.scala:27)
at