Re: Consume WebService in Spark

2016-05-02 Thread Jörn Franke
It is in Spark not different compared to another program. However a web service and json is probably not very suitable for large data volumes. > On 03 May 2016, at 04:45, KhajaAsmath Mohammed > wrote: > > Hi, > > I am working on a project to pull data from sprinklr

Re: Performance benchmarking of Spark Vs other languages

2016-05-02 Thread Jörn Franke
Hallo, Spark is a general framework for distributed in-memory processing. You can always write a highly-specified piece of code which is faster than Spark, but then it can do only one thing and if you need something else you will have to rewrite everything from scratch . This is why Spark is

Performance benchmarking of Spark Vs other languages

2016-05-02 Thread Abhijith Chandraprabhu
Hello, I am trying to find some performance figures of spark vs various other languages for ALS based recommender system. I am using 20 million ratings movielens dataset. The test environment involves one big 30 core machine with 132 GB memory. I am using the scala version of the script provided

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
print() isn't really the best way to benchmark things, since it calls take(10) under the covers, but 380 records / second for a single receiver doesn't sound right in any case. Am I understanding correctly that you're trying to process a large number of already-existing kafka messages, not keep

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi, its morning 4:40 here, therefore I might not be getting things right. But there is a very high chance of getting spurious results in case you have created that variable more than once in IPython or pyspark shell and cached it and are re using it. Please close the sessions and create the

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread sunday2000
I use this command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.0 package -DskipTests -X and get this failure message: [INFO] [INFO] BUILD FAILURE [INFO]

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread Ted Yu
I rebuilt 1.6.1 locally: [INFO] Spark Project External Kafka ... SUCCESS [ 30.868 s] [INFO] Spark Project Examples . SUCCESS [02:29 min] [INFO] Spark Project External Kafka Assembly .. SUCCESS [ 9.644 s] [INFO]

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread sunday2000
Hi, This is not a continuation of a previous query, and now building by connect to inernet without a proxy as before. After disable Zinc, get this errormessage: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project

Consume WebService in Spark

2016-05-02 Thread KhajaAsmath Mohammed
Hi, I am working on a project to pull data from sprinklr for every 15 minutes and process that in spark. After processing it, I need to save that back in s3 Bukcet. is there a way that I can talk to webservice in spark directly and parse the response of the json data ? Thanks, Asmath

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread Ted Yu
Looks like this was continuation of your previous query. If that is the case, please use original thread so that people can have more context. Have you tried disabling Zinc server ? What's the version of Java / maven you're using ? Are you behind proxy ? Finally the 1.6.1 artifacts are

spark 1.6.1 build failure of : scala-maven-plugin

2016-05-02 Thread sunday2000
[INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 14.765 s [INFO] Finished at: 2016-05-03T10:08:46+08:00 [INFO] Final Memory: 35M/191M [INFO]

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Luciano Resende
You might have a settings.xml that is forcing your internal Maven repository to be the mirror of external repositories and thus not finding the dependency. On Mon, May 2, 2016 at 6:11 PM, Hien Luu wrote: > Not I am not. I am considering downloading it manually and place it

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Hien Luu
Not I am not. I am considering downloading it manually and place it in my local repository. On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Oracle jdbc is not part of Maven repository, are you keeping a downloaded > file in your local repo? > >

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Ted Yu
>From the output of dependency:tree of master branch: [INFO] [INFO] Building Spark Project Docker Integration Tests 2.0.0-SNAPSHOT [INFO] [WARNING] The

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread रविशंकर नायर
Oracle jdbc is not part of Maven repository, are you keeping a downloaded file in your local repo? Best, RS On May 2, 2016 8:51 PM, "Hien Luu" wrote: > Hi all, > > I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0. > It kept getting "Operation timed

Error from reading S3 in Scala

2016-05-02 Thread Zhang, Jingyu
Hi All, I am using Eclipse with Maven for developing Spark applications. I got a error for Reading from S3 in Scala but it works fine in Java when I run them in the same project in Eclipse. The Scala/Java code and the error in following Scala val uri = URI.create("s3a://" + key + ":" + seckey

Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-02 Thread Hien Luu
Hi all, I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0. It kept getting "Operation timed out" while building Spark Project Docker Integration Tests module (see the error below). Has anyone run this problem before? If so, how did you resolve around this problem? [INFO]

RE: Spark standalone workers, executors and JVMs

2016-05-02 Thread Mohammed Guller
The workers and executors run as separate JVM processes in the standalone mode. The use of multiple workers on a single machine depends on how you will be using the clusters. If you run multiple Spark applications simultaneously, each application gets its own its executor. So, for example, if

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hello again. I searched for "backport kafka" in the list archives but couldn't find anything but a post from Spark 0.7.2 . I was going to use accumulators to make a counter, but then saw on the Streaming tab the Receiver Statistics. Then I removed all other "functionality" except:

Re: Redirect from yarn to spark history server

2016-05-02 Thread Marcelo Vanzin
See http://spark.apache.org/docs/latest/running-on-yarn.html, especially the parts that talk about spark.yarn.historyServer.address. On Mon, May 2, 2016 at 2:14 PM, satish saley wrote: > > > Hello, > > I am running pyspark job using yarn-cluster mode. I can see spark job

Redirect from yarn to spark history server

2016-05-02 Thread satish saley
Hello, I am running pyspark job using yarn-cluster mode. I can see spark job in yarn but I am able to go from any "log history" link from yarn to spark history server. How would I keep track of yarn log and its corresponding log in spark history server? Is there any setting in yarn/spark that let

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected

Re: Number of executors change during job running

2016-05-02 Thread Vikash Pareek
Hi Bill, You can try DirectStream and increase # of partition to kafka. then input Dstream will have the partitions as per kafka topic without using re-partitioning. Can you please share your event timeline chart from spark ui. You need to tune your configuration as per computation. Spark ui

QueryExecution to String breaks with OOM

2016-05-02 Thread Brandon White
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at

RE: Weird results with Spark SQL Outer joins

2016-05-02 Thread Yong Zhang
We are still not sure what is the problem, if you cannot show us with some example data. For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with swig on 3 columns; with additional filters, get the same resultSet row count as dps lefter outer join with swig on 3 columns,

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034

how to orderBy previous groupBy.count.orderBy in pyspark

2016-05-02 Thread webe3vt
I have the following simple example that I can't get to work correctly. In [1]: from pyspark.sql import SQLContext, Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql.functions import asc, desc, sum, count sqlContext = SQLContext(sc) error_schema

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
This shows that both the tables have matching records and no mismatches. Therefore obviously you have the same results irrespective of whether you use right or left join. I think that there is no problem here, unless I am missing something. Regards, Gourav On Mon, May 2, 2016 at 7:48 PM, kpeng1

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread kpeng1
Also, the results of the inner query produced the same results: sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN dps_pin_promo_lt d ON (s.date

Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Adding back user@spark. Since the top of stack trace is in Datastax class(es), I suggest polling on their mailing list. On Mon, May 2, 2016 at 11:29 AM, Piyush Verma wrote: > Hmm weird. They show up on the Web interface. > > Wait, got it. Its wrapped up Inside the < raw

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi Kevin, Thanks. Please post the result of the same query with INNER JOIN and then it will give us a bit of insight. Regards, Gourav On Mon, May 2, 2016 at 7:10 PM, Kevin Peng wrote: > Gourav, > > Apologies. I edited my post with this information: > Spark version: 1.6 >

Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Maybe you were trying to embed pictures for the error and your code - but they didn't go through. On Mon, May 2, 2016 at 10:32 AM, meson10 wrote: > Hi, > > I am trying to save a RDD to Cassandra but I am running into the following > error: > > > > The Python code looks

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
Jorn, what aspects are you speaking about ? My response was absolutely pertinent to Jinan because he will not even face the problem if he used Scala. So it was along the lines of helping a person to learn fishing that giving him a fish. And by the way your blinkered and biased response missed

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi Cody, I'm going to use an accumulator right now to get an idea of the throughput. Thanks for mentioning the back ported module. Also it looks like I missed this section: https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch from the

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi David, My current concern is that I'm using a spark hbase bulk put driver written for Spark 1.2 on the version of CDH my spark / yarn cluster is running on. Even if I were to run on another Spark cluster, I'm concerned that I might have issues making the put requests into hbase. However I

Re: Reading from Amazon S3

2016-05-02 Thread Jörn Franke
You See oversimplifying here and some of your statements are not correct. There are also other aspects to consider. Finally, it would be better to support him with the problem, because Spark supports Java. Java and Scala run on the same underlying JVM. > On 02 May 2016, at 17:42, Gourav

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon,

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just deserialize)? Are you limited to using spark 1.2, or is upgrading possible? The kafka direct stream is available starting with 1.3. If you're stuck on 1.2, I believe there have been some attempts to backport it, search the

Re: SparkSQL with large result size

2016-05-02 Thread Gourav Sengupta
Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and writing it to Parquet format and then querying parquet format to query it and partition the data and write out individual csv files without any issues on a single node SPARK cluster installation. Are you trying to

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi, As always, can you please write down details regarding your SPARK cluster - the version, OS, IDE used, etc? Regards, Gourav Sengupta On Mon, May 2, 2016 at 5:58 PM, kpeng1 wrote: > Hi All, > > I am running into a weird result with Spark SQL Outer joins. The results >

Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
I've written an application to get content from a kafka topic with 1.7 billion entries, get the protobuf serialized entries, and insert into hbase. Currently the environment that I'm running in is Spark 1.2. With 8 executors and 2 cores, and 2 jobs, I'm only getting between 0-2500 writes /

java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread meson10
Hi, I am trying to save a RDD to Cassandra but I am running into the following error: The Python code looks like this: I am using DSE 4.8.6 which runs Spark 1.4.2 I ran through a bunch of existing posts on this mailing lists and have already performed the following routines: * Ensure that

Weird results with Spark SQL Outer joins

2016-05-02 Thread kpeng1
Hi All, I am running into a weird result with Spark SQL Outer joins. The results for all of them seem to be the same, which does not make sense due to the data. Here are the queries that I am running with the results: sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS

Re: kafka direct streaming python API fromOffsets

2016-05-02 Thread Cody Koeninger
If you're confused about the type of an argument, you're probably better off looking at documentation that includes static types: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$ createDirectStream's fromOffsets parameter takes a map from

Re: SparkSQL with large result size

2016-05-02 Thread Buntu Dev
Thanks Ted, I thought the avg. block size was already low and less than the usual 128mb. If I need to reduce it further via parquet.block.size, it would mean an increase in the number of blocks and that should increase the number of tasks/executors. Is that the correct way to interpret this? On

Fwd: Spark standalone workers, executors and JVMs

2016-05-02 Thread Simone Franzini
I am still a little bit confused about workers, executors and JVMs in standalone mode. Are worker processes and executors independent JVMs or do executors run within the worker JVM? I have some memory-rich nodes (192GB) and I would like to avoid deploying massive JVMs due to well known performance

zero-length input partitions from parquet

2016-05-02 Thread Han JU
Hi, I just found out that we can have lots of empty input partitions when reading from parquet files. Sample code as following: val hconf = sc.hadoopConfiguration val job = new Job(hconf) FileInputFormat.setInputPaths(job, new Path("path_to_data"))

Spark standalone workers, executors and JVMs

2016-05-02 Thread captainfranz
I am still a little bit confused about workers, executors and JVMs in standalone mode. Are worker processes and executors independent JVMs or do executors run within the worker JVM? I have some memory-rich nodes (192GB) and I would like to avoid deploying massive JVMs due to well known performance

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
JAVA does not easily parallelize, JAVA is verbose, uses different classes for serializing, and on top of that you are using RDD's instead of dataframes. Should a senior project not have an implied understanding that it should be technically superior? Why not use SCALA? Regards, Gourav On Mon,

RE: Reading from Amazon S3

2016-05-02 Thread Jinan Alhajjaj
Because I am doing this project for my senior project by using Java.I try s3a URI as this: s3a://accessId:secret@bucket/path It show me an error :Exception in thread "main" java.lang.NoSuchMethodError:

Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
Please consider decreasing block size. Thanks > On May 1, 2016, at 9:19 PM, Buntu Dev wrote: > > I got a 10g limitation on the executors and operating on parquet dataset with > block size 70M with 200 blocks. I keep hitting the memory limits when doing a > 'select * from

REST API submission and Application ID

2016-05-02 Thread thibault.duperron
Hi, I tried to monitor spark applications through spark APIs. I can submit new application/driver with the REST API (POST http://spark-cluster-ip:6066/v1/submissions/create ...). The API return the driver's id (submissionId). I can check the driver's status and kill it with the same API.

Re: Can not import KafkaProducer in spark streaming job

2016-05-02 Thread fanooos
I could solve the issue but the solution is very weird. I run this command cat old_script.py > new_script.py then I submitted the job using the new script. This is the second time I face such issue with python script and I have no explanation to what happened. I hope this trick help someone

kafka direct streaming python API fromOffsets

2016-05-02 Thread Tigran Avanesov
Hi, I'm trying to start consuming messages from a kafka topic (via direct stream) from a given offset. The documentation of createDirectStream says: :param fromOffsets: Per-topic/partition Kafka offsets defining the (inclusive) starting point of the stream. However it expects a dictionary

Re: SparkSQL with large result size

2016-05-02 Thread ayan guha
How many executors are you running? Is your partition scheme ensures data is distributed evenly? It is possible that your data is skewed and one of the executors failing. Maybe you can try reduce per executor memory and increase partitions. On 2 May 2016 14:19, "Buntu Dev"

Re: Spark on AWS

2016-05-02 Thread Gourav Sengupta
Hi, I agree with Steve, just start using vanilla SPARK EMR. You can try to see point #4 here for dynamic allocation of executors https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin . Note that dynamic allocation of

A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2016-05-02 Thread Kapil Raaj
Hi folks, I am suddenly seeing : Error:scalac: bad symbolic reference. A signature in Logging.class refers to type Logger in package org.slf4j which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version

Re: Error in spark-xml

2016-05-02 Thread Mail.com
Can you try once by creating your own schema file and using it to read the XML. I had similar issue but got that resolved by custom schema and by specifying each attribute in that. Pradeep > On May 1, 2016, at 9:45 AM, Hyukjin Kwon wrote: > > To be more clear, > > If

?????? Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnottransfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org:unk

2016-05-02 Thread sunday2000
hi, by stopping Zinc server, got this error message: [INFO] Spark Project External Kafka Assembly .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO]