Re: Aggregator support in DataFrame

2016-04-11 Thread Koert Kuipers
saw that, dont think it solves it. i basically want to add some children to the expression i guess, to indicate what i am operating on? not sure if even makes sense On Mon, Apr 11, 2016 at 8:04 PM, Michael Armbrust wrote: > I'll note this interface has changed recently:

Re: how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
Okie. That makes sense. Any recommendations on how to manage changes to my spark streaming app and achieving fault tolerance at the same time On Mon, Apr 11, 2016 at 8:16 PM, Shixiong(Ryan) Zhu wrote: > You cannot. Streaming doesn't support it because code changes

Re: how to deploy new code with checkpointing

2016-04-11 Thread Shixiong(Ryan) Zhu
You cannot. Streaming doesn't support it because code changes will break Java serialization. On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli wrote: > hello, > > i am writing a spark streaming application to read data from kafka. I am > using no receiver approach and enabled

Re: Aggregator support in DataFrame

2016-04-11 Thread Michael Armbrust
I'll note this interface has changed recently: https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d I'm not sure that solves your problem though... On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers wrote: > i like the Aggregator a lot

Aggregator support in DataFrame

2016-04-11 Thread Koert Kuipers
i like the Aggregator a lot (org.apache.spark.sql.expressions.Aggregator), but i find the way to use it somewhat confusing. I am supposed to simply call aggregator.toColumn, but that doesn't allow me to specify which fields it operates on in a DataFrame. i would basically like to do something

how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
hello, i am writing a spark streaming application to read data from kafka. I am using no receiver approach and enabled checkpointing to make sure I am not reading messages again in case of failure. (exactly once semantics) i have a quick question how checkpointing needs to be configured to

History Server Refresh?

2016-04-11 Thread Miles Crawford
Hey there. I have my spark applications set up to write their event logs into S3 - this is super useful for ephemeral clusters, I can have persistent history even though my hosts go away. A history server is set up to view this s3 location, and that works fine too - at least on startup. The

Re: Infinite recursion in createDataFrame for avro types

2016-04-11 Thread Brad Cox
Regarding the attached report, I really need a clue... By replacing the avro-generated bean with a hand-coded one and steadily pruning it down to bare essentials, I've confirmed that the stack overflow is triggered by precisely this avro-generated line: public org.apache.avro.Schema

Re: Is storage resources counted during the scheduling

2016-04-11 Thread Jialin Liu
Thanks Ted, but that page seems to be scheduling policy, I have no idea of what resources are considered in the scheduling. And for scheduling, I’m wondering, in case of just one application, is there also a scheduling process? otherwise, why I see some launching delay in the tasks. (well,

RE: ML Random Forest Classifier

2016-04-11 Thread Ashic Mahtab
Thanks, James. That looks promising. Date: Mon, 11 Apr 2016 10:41:07 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org To add a bit more detail perhaps something like this might work: package org.apache.spark.ml import

Re: Is storage resources counted during the scheduling

2016-04-11 Thread Ted Yu
See https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application On Mon, Apr 11, 2016 at 3:15 PM, Jialin Liu wrote: > Hi Spark users/experts, > > I’m wondering how does the Spark scheduler work? > What kind of resources will be considered during the

Is storage resources counted during the scheduling

2016-04-11 Thread Jialin Liu
Hi Spark users/experts, I’m wondering how does the Spark scheduler work? What kind of resources will be considered during the scheduling, does it include the disk resources or I/O resources, e.g., number of IO ports. Is network resources considered in that? My understanding is that only CPU

Re: Profiling a spark job

2016-04-11 Thread Alexander Krasheninnikov
If you are profiling in standalone mode, I recommend you to try with Java Mission Control. You just need to start app with these params: -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=$YOUR_PORT

Grouping in Spark Streaming / batch size = time window?

2016-04-11 Thread Daniela S
Hi,   I am a newbie in Spark Streaming and have some questions.   1) Is it possible to group a stream in Spark Streaming like in Storm (field grouping)?   2) Could the batch size be used instead of a time window?   Thank you in advance.   Regards, Daniela    

Re: Read JSON in Dataframe and query

2016-04-11 Thread Ted Yu
Please take a look at sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Cheers On Mon, Apr 11, 2016 at 12:13 PM, Radhakrishnan Iyer < radhakrishnan.i...@citiustech.com> wrote: > Hi all, > > > > I am new to Spark. > > I have a json in below format : > >

Re: Hello !

2016-04-11 Thread mylisttech
Thank you ! On Apr 12, 2016, at 1:41, Ted Yu wrote: > For SparkR, please refer to https://spark.apache.org/docs/latest/sparkr.html > > bq. on Ubuntu or CentOS > > Both platforms are supported. > > On Mon, Apr 11, 2016 at 1:08 PM, wrote: > Dear

Re: Hello !

2016-04-11 Thread Ted Yu
For SparkR, please refer to https://spark.apache.org/docs/latest/sparkr.html bq. on Ubuntu or CentOS Both platforms are supported. On Mon, Apr 11, 2016 at 1:08 PM, wrote: > Dear Experts , > > I am posting this for your information. I am a newbie to spark. > I am

Hello !

2016-04-11 Thread mylisttech
Dear Experts , I am posting this for your information. I am a newbie to spark. I am interested in understanding Spark at the internal level. I need your opinion, which unix flavor should I install spark on Ubuntu or CentOS. I have had enough trouble with the windows version (1.6.1 with Hadoop

[ML] Training with bias

2016-04-11 Thread Daniel Siegmann
I'm trying to understand how I can add a bias when training in Spark. I have only a vague familiarity with this subject, so I hope this question will be clear enough. Using liblinear a bias can be set - if it's >= 0, there will be an additional weight appended in the model, and predicting with

How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-11 Thread Andrei
I'm working on a wrapper [1] around Spark for the Julia programming language [2] similar to PySpark. I've got it working with Spark Standalone server by creating local JVM and setting master programmatically. However, this approach doesn't work with YARN (and probably Mesos), which require running

Re: alter table add columns aternatives or hive refresh

2016-04-11 Thread Maurin Lenglart
I will try that during the next w-e. Thanks you for your answers. From: Mich Talebzadeh > Date: Sunday, April 10, 2016 at 11:54 PM To: maurin lenglart > Cc: "user @spark"

Read JSON in Dataframe and query

2016-04-11 Thread Radhakrishnan Iyer
Hi all, I am new to Spark. I have a json in below format : Employee : {"id" : "001","name":"abc", "address" : [ {"state" : "maharashtra","country" : "india" } , {"state" : "gujarat","country" : "india" }]} I want to parse in such a way that I can query address For e.g Get list of states where

Max number of dataframes in Union

2016-04-11 Thread Darshan Singh
Hi, I would like to know if there is any max limit of union of data-frames. How does performance of say 1 data frame union will be in spark of which all the data will be in cache? Other option is 1 partitions of a single dataframe. Thanks

Re: [Streaming] textFileStream has no events shown in web UI

2016-04-11 Thread Yogesh Mahajan
Yes, this has observed in my case also. The Input Rate is 0 even in case of rawSocketStream. Is there a way we can enable the Input Rate for these types of streams ? Thanks, http://www.snappydata.io/blog On Wed, Mar 16, 2016 at 4:21 PM, Hao Ren wrote: >

Re: How to estimate the size of dataframe using pyspark?

2016-04-11 Thread Davies Liu
That's weird, DataFrame.count() should not require lots of memory on driver, could you provide a way to reproduce it (could generate fake dataset)? On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev wrote: > I've allocated about 4g for the driver. For the count stage, I notice the >

Re: strange behavior of pyspark RDD zip

2016-04-11 Thread Davies Liu
It seems like a bug, could you file a JIRA for this? (also post a way to reproduce it) On Fri, Apr 1, 2016 at 11:08 AM, Sergey wrote: > Hi! > > I'm on Spark 1.6.1 in local mode on Windows. > > And have issue with zip of zip'pping of two RDDs of __equal__ size and > __equal__

Re: Sqoop on Spark

2016-04-11 Thread Jörn Franke
Actually I was referring to have a an external table in Oracle, which is used to export to CSV (insert into). Then you have a csv on the database server which needs to be moved to HDFS. > On 11 Apr 2016, at 17:50, Michael Segel wrote: > > Depending on the Oracle

Re: Using spark.memory.useLegacyMode true does not yield expected behavior

2016-04-11 Thread Tom Hubregtsen
Solved: Call spark-submit with --driver-memory 512m --driver-java-options "-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" Thanks to: https://issues.apache.org/jira/browse/SPARK-14367 -- View this

Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
Depending on the Oracle release… You could use webHDFS to gain access to the cluster and see the CSV file as an external table. However, you would need to have an application that will read each block of the file in parallel. This works for loading in to the RDBMS itself. Actually you

Re: Connection closed Exception.

2016-04-11 Thread Bijay Kumar Pathak
Hi Rodrick, I had tried increasing memory from 6G to 9G to 12G but still I am getting the same error. The size of dataframe I am trying to write is around 6-7 G and the Hive table is Parquet format. Thanks, Bijay On Mon, Apr 11, 2016 at 4:03 AM, Rodrick Brown

Re: Spark can't submit App to yarn

2016-04-11 Thread jiekechoo
submit to Standalone mode, it works ok. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-can-t-submit-App-to-yarn-tp26745p26746.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark can't submit App to yarn

2016-04-11 Thread jiekechoo
Anyone can help me? Log Type: stderr Log Upload Time: 星期一 四月 11 19:55:35 +0800 2016 Log Length: 65816 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/10/spark-hdp-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:

Spark App can't submit to YARN

2016-04-11 Thread Jieke Choo
Anyone can help me? Log Type: stderr Log Upload Time: 星期一 四月 11 19:55:35 +0800 2016 Log Length: 65816 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/10/spark-hdp-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:

RE: Why Spark having OutOfMemory Exception?

2016-04-11 Thread Lohith Samaga M
Hi Kramer, Some options: 1. Store in Cassandra with TTL = 24 hours. When you read the full table, you get the latest 24 hours data. 2. Store in Hive as ORC file and use timestamp field to filter out the old data. 3. Try windowing in spark or flink (have not used

An error occurred while calling o199.parquet.

2016-04-11 Thread Malouke
hi , i get this errors when i performe join two data frames under pyspark 1.5.0 and cloudera 5.5.1 and yarn application i have 15 nodes i can use 5 executors with 20g of memeory i work on IPYTHON , i have small table with 6 GO of data for large one and 30 mo for little one. Py4JJavaError

Re: Connection closed Exception.

2016-04-11 Thread Rodrick Brown
Try increasing the memory allocated for this job.  Sent from Outlook for iPhone On Sun, Apr 10, 2016 at 9:12 PM -0700, "Bijay Kumar Pathak" wrote: Hi, I am running Spark 1.6 on EMR. I have workflow which does the following things:Read the 2 flat file, create the

Why Spark having OutOfMemory Exception?

2016-04-11 Thread kramer2...@126.com
I use spark to do some very simple calculation. The description is like below (pseudo code): While timestamp == 5 minutes df = read_hdf() # Read hdfs to get a dataframe every 5 minutes my_dict[timestamp] = df # Put the data frame into a dict delete_old_dataframe( my_dict )

Re: HiveContext unable to recognize the delimiter of Hive table in textfile partitioned by date

2016-04-11 Thread Shiva Achari
Hi All, In the above scenario if the field delimiter is default of hive then Spark is able to parse the data as expected , hence i believe this is a bug. ​Regards, Shiva Achari​ On Tue, Apr 5, 2016 at 8:15 PM, Shiva Achari wrote: > Hi, > > I have created a hive

[no subject]

2016-04-11 Thread Angel Angel
Hello, I am writing the one spark application, it runs well but takes long execution time can anyone help me to optimize my query to increase the processing speed. I am writing one application in which i have to construct the histogram and compare the histograms in order to find the final

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
To add a bit more detail perhaps something like this might work: package org.apache.spark.ml > > > import org.apache.spark.ml.classification.RandomForestClassificationModel > > import org.apache.spark.ml.classification.DecisionTreeClassificationModel > > import

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've

ML Random Forest Classifier

2016-04-11 Thread Ashic Mahtab
Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any

spark.driver.extraClassPath and export SPARK_CLASSPATH

2016-04-11 Thread AlexModestov
Hello, I've started to use Spark 1.6.1 before I used Spark 1.5. I included the string export SPARK_CLASSPATH="/SQLDrivers/sqljdbc_4.2/enu/sqljdbc41.jar" when I launched pysparkling and it worked well. But in version 1.6.1 there is an error that it's deprecated and I had to use

Re: Getting NPE when trying to do spark streaming with Twitter

2016-04-11 Thread Akhil Das
Looks like a yarn issue to me, Can you try checking out this code? https://github.com/akhld/sparkstreaming-twitter just git clone and do a sbt run after configuring your credentials in the main file

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-11 Thread ashesh_28
I have Modified my Yarn-site to include the following properties , yarn.nodemanager.resource.memory-mb 4096 yarn.scheduler.minimum-allocation-mb 256 yarn.scheduler.maximum-allocation-mb 2250

Spark demands HiveContext but I use only SqlContext

2016-04-11 Thread AlexModestov
Hello! I work with SqlContext, I create a query to MS Sql Server and get data... Spark says to me that I have to install hive... I have started to use Spark 1.6.1 (before I used Spark 1.5 and I have never heard about this necessity early)... Py4JJavaError: An error occurred while calling

Re: Spark not handling Null

2016-04-11 Thread Akhil Das
Surround it with a try..catch where its complaining about the null pointer to avoid the job being failed. What is happening here is like you are returning null and the following operation is working on null which causes the job to fail. Thanks Best Regards On Mon, Apr 11, 2016 at 12:51 PM,

Spark not handling Null

2016-04-11 Thread saurabh guru
Trying to run following causes a NullPointer Exception. While I thought Spark should have been able to handle Null, apparently it is not able to. What could I return in place of null? What other ways could I approach this?? There are at times, I would want to just skip parsing and proceed to next

Re: alter table add columns aternatives or hive refresh

2016-04-11 Thread Mich Talebzadeh
This should work. Make sure that you use HiveContext.sql and sqlContext correctly This is an example in Spark, reading a CSV file, doing some manipulation, creating a temp table, saving data as ORC file, adding another column and inserting values to table in Hive with default values for new rows