New Spark User Group in Florida

2016-03-06 Thread रविशंकर नायर
Hi Organizer, We have just started a new user group for Spark in Florida. Can you please add this entry in Spark community ? Thanks Florida Spark Meetup Best regards, R Nair.

Re: how to implement ALS with csv file? getting error while calling Rating class

2016-03-06 Thread Nick Pentreath
As you've pointed out, Rating requires user and item ids in Int form. So you will need to map String user ids to integers. See this thread for example: https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJgQjQ9GhGqpg1=hvxpfrs+59elfj9f7knhp8nyqnh1ut_6...@mail.gmail.com%3E .

reading the parquet file in spark sql

2016-03-06 Thread Angel Angel
Hello Sir/Madam, I am running one spark application having 3 slaves and one master. I am wring the my information using the parquet format. but when i am trying to read it got some error. Please help me to resolve this problem. code ; val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Re: Spark reduce serialization question

2016-03-06 Thread Holden Karau
You might want to try treeAggregate On Sunday, March 6, 2016, Takeshi Yamamuro wrote: > Hi, > > I'm not exactly sure what's your codes like though, ISTM this is a correct > behaviour. > If the size of data that a driver fetches exceeds the limit, the driver > throws this

Re: Spark reduce serialization question

2016-03-06 Thread Takeshi Yamamuro
Hi, I'm not exactly sure what's your codes like though, ISTM this is a correct behaviour. If the size of data that a driver fetches exceeds the limit, the driver throws this exception. (See

Understanding the Web_UI 4040

2016-03-06 Thread Angel Angel
Hello Sir/Madam, I am running the spark-sql application on the cluster. In my cluster there are 3 slaves and one Master. When i saw the progress of my application in web UI hadoopm0:8080 I found that one of my slaves node is always in *LOADDING *mode. Can you tell me what is that mean? Also

YARN nodemanager always uses its own hostname

2016-03-06 Thread HIGUCHI Daisuke
Hello, I build Spark on HDFS/YARN cluster on Docker Containers. * Spark on YARN version - Spark 1.6.0 - Hadoop 2.6.0 (CDH 5.6.0) - Oracle Java 1.8.0_74 There are one HDFS/YARN master and one HDFS/YARN worker on each containers. spark-yarn-master container has below hostname and IP addr.

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-06 Thread Chris Miller
For anyone running into this same issue, it looks like Avro deserialization is just broken when used with SparkSQL and partitioned schemas. I created an bug report with details and a simplified example on how to reproduce: https://issues.apache.org/jira/browse/SPARK-13709 -- Chris Miller On

Re: Is Spark right for us?

2016-03-06 Thread Chris Miller
Gut instinct is no, Spark is overkill for your needs... you should be able to accomplish all of that with a relational database or a column oriented database (depending on the types of queries you most frequently run and the performance requirements). -- Chris Miller On Mon, Mar 7, 2016 at 1:17

Re: Is Spark right for us?

2016-03-06 Thread Peyman Mohajerian
if your relational database has enough computing power, you don't have to change it. You can just run SQL queries on top of it or even run Spark queries over it. There is no hard-fast rule about using big data tools. Usually people or organizations don't jump into big data for one specific use

Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single

Spark Custom Partitioner not picked

2016-03-06 Thread Prabhu Joseph
Hi All, When i am submitting a spark job on YARN with Custom Partitioner, it is not picked by Executors. Executors still using the default HashPartitioner. I added logs into both HashPartitioner (org/apache/spark/Partitioner.scala) and Custom Partitioner. The completed executor logs shows

Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi, once again that is all about tooling. Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 7:52 PM, Mich Talebzadeh wrote: > Hi, > > > > What is the current size of your relational database? > > > > Are we talking about a row based RDBMS (Oracle, Sybase) or a

Re: Is Spark right for us?

2016-03-06 Thread Mich Talebzadeh
Hi, What is the current size of your relational database? Are we talking about a row based RDBMS (Oracle, Sybase) or a columnar one (Teradata/ Sybase IQ)? I assume that you will be using SQL wherever you migrate to. The SQL-on-Hadoop tools are divided between well thought out solutions

Re: MLLib + Streaming

2016-03-06 Thread Lan Jiang
Thanks, Guru. After reading the implementation of StreamingKMean, StreamingLinearRegressionWithSGD and StreamingLogisticRegressionWithSGD, I reached the same conclusion. But unfortunately, this distinction between true online learning and offline learning are implied in the documentation and I

Re: Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Sorry, accidentally sent again. My apology. > On Mar 6, 2016, at 1:22 PM, Lan Jiang wrote: > > Hi, there > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regression have a > Streaming

Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Hi, there I hope someone can clarify this for me. It seems that some of the MLlib algorithms such as KMean, Linear Regression and Logistics Regression have a Streaming version, which can do online machine learning. But does that mean other MLLib algorithm cannot be used in Spark streaming

Re: Spark + Kafka all messages being used in 1 batch

2016-03-06 Thread Shahbaz
- Do you happen to see how busy are the nodes in terms of CPU and how much heap each executor is allocated with. - If there is enough capacity ,you may want to increase number of cores per executor to 2 and do the needed heap tweaking. - How much time did it take to process 4M+

Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi, SPARK is just tooling, and its even not tooling. You can consider SPARK a distributed operating system like YARN. You should read books like HADOOP Application Architecture, Big Data (Nathan Marz) and other disciplines before starting to consider how the solution is built. Most of the big

Re: Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar
Thanks Ted! I have created https://issues.apache.org/jira/browse/SPARK-13707 JIRA ticket. As I commented, I would like to work on the fix if we can decide what should the correct behavior be. -- Thanks Jatin Kumar | Rocket Scientist +91-7696741743 m On Sun, Mar 6, 2016 at 11:30 PM, Ted Yu

Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi, That depends on a lot of things, but as a starting point I would ask whether you are planning to store your data in JSON format? Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi < guillaume.bilod...@gmail.com> wrote: > Our problem space is survey analytics. Each

Re: how to implements a distributed system ?

2016-03-06 Thread Ted Yu
w.r.t. akka, please see the following: [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming There're various ways to design distributed system. Can you outline what your program does ? Cheers On Sun, Mar 6, 2016 at 8:35 AM, Minglei Zhang wrote: > hello, experts

Is Spark right for us?

2016-03-06 Thread Laumegui Deaulobi
Our problem space is survey analytics. Each survey comprises a set of questions, with each question having a set of possible answers. Survey fill-out tasks are sent to users, who have until a certain date to complete it. Based on these survey fill-outs, reports need to be generated. Each

how to implements a distributed system ?

2016-03-06 Thread Minglei Zhang
hello, experts Suppose I have a local machine with a program contains a function with main function, in java it is public static void main, in scala ,it's def object extends App. But now, I want make it to a distributed program, how can I do and make it possible? Maybe it is a direct and simple

how to implement ALS with csv file? getting error while calling Rating class

2016-03-06 Thread Shishir Anshuman
I am new to apache Spark, and I want to implement the Alternating Least Squares algorithm. The data set is stored in a csv file in the format: *Name,Value1,Value2*. When I read the csv file, I get *java.lang.NumberFormatException.forInputString* error because the Rating class needs the parameters

Re: Spark + Kafka all messages being used in 1 batch

2016-03-06 Thread Vinti Maheshwari
I have 2 machines in my cluster with the below specifications: 128 GB RAM and 8 cores machine Regards, ~Vinti On Sun, Mar 6, 2016 at 7:54 AM, Vinti Maheshwari wrote: > Thanks Supreeth and Shahbaz. I will try adding > spark.streaming.kafka.maxRatePerPartition. > > Hi

Re: Spark + Kafka all messages being used in 1 batch

2016-03-06 Thread Vinti Maheshwari
Thanks Supreeth and Shahbaz. I will try adding spark.streaming.kafka.maxRatePerPartition. Hi Shahbaz, Please see comments, inline: - Which version of Spark you are using. ==> *1.5.2* - How big is the Kafka Cluster ==> *2 brokers* - What is the Message Size and type.==> *String, 9,550

Re: Streaming UI tab misleading for window operations

2016-03-06 Thread Ted Yu
Have you taken a look at SPARK-12739 ? FYI On Sun, Mar 6, 2016 at 4:06 AM, Jatin Kumar < jku...@rocketfuelinc.com.invalid> wrote: > Hello all, > > Consider following two code blocks: > > val ssc = new StreamingContext(sparkConfig, Seconds(2)) > val stream = KafkaUtils.createDirectStream(...) >

Re: How can I pass a Data Frame from object to another class

2016-03-06 Thread Mich Talebzadeh
It would be interesting why these contexts are not available in the JVM outside of the class they were instigated (created). For example we could initialize an application with two threads as follows in the main method val conf = new SparkConf(). setAppName("Harness4").

Re: Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-06 Thread Dhaval Modi
Hi Gourav, I am trying to overwrite existing managed/internal table. I havent register dataframe, so it's not a temporary table. BTW, I have added code in JIRA as comment. Thanks, Dhaval On Mar 6, 2016 17:07, "Gourav Sengupta" wrote: > hi, > > is the table that

Streaming UI tab misleading for window operations

2016-03-06 Thread Jatin Kumar
Hello all, Consider following two code blocks: val ssc = new StreamingContext(sparkConfig, Seconds(2)) val stream = KafkaUtils.createDirectStream(...) a) stream.filter(filterFunc).count().foreachRDD(rdd => println(rdd.collect())) b) stream.filter(filterFunc).window(Seconds(60),

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Ted Yu
Thanks for the clarification, Gourav. > On Mar 6, 2016, at 3:54 AM, Gourav Sengupta wrote: > > Hi Ted, > > There was no idle time after I changed the path to start with s3a and then > ensured that the number of executors writing were large. The writes start and >

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Gourav Sengupta
Hi Ted, There was no idle time after I changed the path to start with s3a and then ensured that the number of executors writing were large. The writes start and complete in about 5 mins or less. Initially the write used to complete by around 30 mins and we could see that there were failure

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Ted Yu
Gourav: For the 3rd paragraph, did you mean the job seemed to be idle for about 5 minutes ? Cheers > On Mar 6, 2016, at 3:35 AM, Gourav Sengupta wrote: > > Hi, > > This is a solved problem, try using s3a instead and everything will be fine. > > Besides that you

Re: How can I pass a Data Frame from object to another class

2016-03-06 Thread Gourav Sengupta
Hi Ted/ Holden, I had read a section in the book Learning Spark which advised against passing entire objects to SPARK instead of just functions (ref: page 30 passing functions to SPARK). Is the above way of solving problem not going against it? It will be exciting to see your kind explanation.

Re: Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-06 Thread Gourav Sengupta
hi, is the table that you are trying to overwrite an external table or temporary table created in hivecontext? Regards, Gourav Sengupta On Sat, Mar 5, 2016 at 3:01 PM, Dhaval Modi wrote: > Hi Team, > > I am facing a issue while writing dataframe back to HIVE table. >

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Gourav Sengupta
Hi, This is a solved problem, try using s3a instead and everything will be fine. Besides that you might want to use coalesce or partitionby or repartition in order to see how many executors are being used to write (that speeds things up quite a bit). We had a write issue taking close to 50 min

how to flatten the dataframe

2016-03-06 Thread shubham@celebal
root |-- adultbasefare: long (nullable = true) |-- adultcommission: long (nullable = true) |-- adultservicetax: long (nullable = true) |-- adultsurcharge: long (nullable = true) |-- airline: string (nullable = true) |-- arrdate: string (nullable = true) |-- arrtime: string (nullable

Re: How can I pass a Data Frame from object to another class

2016-03-06 Thread Mich Talebzadeh
Thanks for this tip The way I do it is to pass SparckContext "sc" to method firstquery.firstquerym by calling the following val firstquery = new FirstQuery firstquery.firstquerym(sc, rs) And creating the method as follows: class FirstQuery { def firstquerym(sc:

Re: Spark Streaming fileStream vs textFileStream

2016-03-06 Thread Yuval.Itzchakov
I dont think the documentation can be anymore descriptive: /** * Create a input stream that monitors a Hadoop-compatible filesystem * for new files and reads them using the given key-value types and input format. * Files must be written to the monitored directory by "moving" them from

Continuous deployment to Spark Streaming application with sessionization

2016-03-06 Thread Yuval.Itzchakov
I've been recently thinking about continuous deployment to our spark streaming service. We have a streaming application which does sessionization via `mapWithState`, aggregating sessions in memory until they are ready to be deployed. Now, as I see things we have two use cases here: 1. Spark

Re: MLLib + Streaming

2016-03-06 Thread Chris Miller
Guru:This is a really great response. Thanks for taking the time to explain all of this. Helpful for me too. -- Chris Miller On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani wrote: > Hi Lan, > > Streaming Means, Linear Regression and Logistic Regression support online > machine

Re: Add the sql record having same field.

2016-03-06 Thread Jacek Laskowski
What about sum? Jacek 06.03.2016 7:28 AM "Angel Angel" napisał(a): > Hello, > I have one table and 2 fields in it > 1) item_id and > 2) count > > > > i want to add the count field as per item (means group the item_ids) > > example > Input > itea_ID Count > 500 2 > 200 6