Will this affect the result of spark?

2016-05-11 Thread sunday2000
Hi, When running spark in cluster mode, it give out this warning message occasionally, will it affect the final result? ERROR CoarseGrainedExecutorBackend: Driver 192.168.1.1:45725 disassociated! Shutting down.

When start spark-sql, postgresql gives errors.

2016-05-11 Thread Joseph
Hi all, I use PostgreSQL to store the hive metadata. First, I imported a sql script to metastore database as follows: psql -U postgres -d metastore -h 192.168.50.30 -f hive-schema-1.2.0.postgres.sql Then, when I started $SPARK_HOME/bin/spark-sql, the PostgreSQL gave the following

Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
Hi Simon Can you describe your problem in more details? I suspect that my problem is because the window function (or may be the groupBy agg functions). If you are the same. May be we should report a bug At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]"

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last the memory usage goes up again... eventually , the stream program crashed. -- View this message in context:

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Telmo Rodrigues
I'm building spark from branch-1.6 source with mvn -DskipTests package and I'm running the following code with spark shell. *val* sqlContext *=* *new* org.apache.spark.sql.*SQLContext*(sc) *import* *sqlContext.implicits._* *val df = sqlContext.read.json("persons.json")* *val df2 =

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
You have kept 3rd party jars at hdfs. I don't think executors as of today can download jars from hdfs.. Can you try with a shared directory.. Application jar is downloaded by executors through http server.. -Raghav On 12 May 2016 00:04, "Giri P" wrote: > Yes..They are

Re: How to transform a JSON string into a Java HashMap<> java.io.NotSerializableException

2016-05-11 Thread Marcelo Vanzin
Is the class mentioned in the exception below the parent class of the anonymous "Function" class you're creating? If so, you may need to make it serializable. Or make your function a proper "standalone" class (either a nested static class or a top-level one). On Wed, May 11, 2016 at 3:55 PM,

How to transform a JSON string into a Java HashMap<> java.io.NotSerializableException

2016-05-11 Thread Andy Davidson
I have a streaming app that receives very complicated JSON (twitter status). I would like to work with it as a hash map. It would be very difficult to define a pojo for this JSON. (I can not use twitter4j) // map json string to map JavaRDD> jsonMapRDD =

Re: kryo

2016-05-11 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtpO0qI3cp06/JodaDateTimeSerializer+spark=Re+NPE+when+using+Joda+DateTime On Wed, May 11, 2016 at 2:18 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > I'm trying to get to use spark.serializer. > I set it in the

kryo

2016-05-11 Thread Younes Naguib
Hi all, I'm trying to get to use spark.serializer. I set it in the spark-default.conf, but I statred getting issues with datetimes. As I understand, I need to disable it. Anyways to keep using kryo? It's seems I can use JodaDateTimeSerializer for datetimes, just not sure how to set it, and

Re: apache spark on gitter?

2016-05-11 Thread Xinh Huynh
Hi Pawel, I'd like to hear more about your idea. Could you explain more why you would like to have a gitter channel? What are the advantages over a mailing list (like this one)? Have you had good experiences using gitter on other open source projects? Xinh On Wed, May 11, 2016 at 11:10 AM, Sean

Re: Setting Spark Worker Memory

2016-05-11 Thread Mich Talebzadeh
run JPS like below jps 19724 SparkSubmit 10612 Worker and do ps awx|grep PID for each number that represents these two descriptions. something like ps awx|grep 30208 30208 pts/2Sl+1:05 /usr/java/latest/bin/java -cp

Re: Setting Spark Worker Memory

2016-05-11 Thread شجاع الرحمن بیگ
yes, i m running this as standalone mode. On Wed, May 11, 2016 at 6:23 PM, Mich Talebzadeh wrote: > are you running this in standalone mode? that is one physical host, and > the executor will live inside the driver. > > > > Dr Mich Talebzadeh > > > > LinkedIn * >

How to take executor memory dump

2016-05-11 Thread Nirav Patel
Hi, I am hitting OutOfMemoryError issues with spark executors. It happens mainly during shuffle. Executors gets killed with OutOfMemoryError. I have try setting up spark.executor.extraJavaOptions to take memory dump but its not happening. spark.executor.extraJavaOptions = "-XX:+UseCompressedOops

Is this possible to do in spark ?

2016-05-11 Thread Pradeep Nayak
Hi - I have a very unique problem which I am trying to solve and I am not sure if spark would help here. I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt. a.txt contains a unique serial number, say: 12345 and b.txt contains key value pairs. a,1 b,1, c,0 etc. Everyday you

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Giri P
Yes..They are reachable. Application jar which I send as argument is at same location as third party jar. Application jar is getting uploaded. On Wed, May 11, 2016 at 10:51 AM, lalit sharma wrote: > Point to note as per docs as well : > > *Note that jars or python

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Michael Armbrust
> > > logical plan after optimizer execution: > > Project [id#0L,id#1L] > !+- Filter (id#0L = cast(1 as bigint)) > ! +- Join Inner, Some((id#0L = id#1L)) > ! :- Subquery t > ! : +- Relation[id#0L] JSONRelation > ! +- Subquery u > ! +- Relation[id#1L] JSONRelation >

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
Some how missed that ;) Anything about Datasets slowness ? On Wed, May 11, 2016, 21:02 Ted Yu wrote: > Which release are you using ? > > You can use the following to disable UI: > --conf spark.ui.enabled=false > > On Wed, May 11, 2016 at 10:59 AM, Amit Sela

Re: apache spark on gitter?

2016-05-11 Thread Sean Owen
I don't know of a gitter channel and I don't use it myself, FWIW. I think anyone's welcome to start one. I hesitate to recommend this, simply because it's preferable to have one place for discussion rather than split it over several, and, we have to keep the @spark.apache.org mailing lists as the

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Ted Yu
Which release are you using ? You can use the following to disable UI: --conf spark.ui.enabled=false On Wed, May 11, 2016 at 10:59 AM, Amit Sela wrote: > I've ran a simple WordCount example with a very small List as > input lines and ran it in standalone (local[*]), and

Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
I've ran a simple WordCount example with a very small List as input lines and ran it in standalone (local[*]), and Datasets is very slow.. We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec. Is this just start-up overhead ? please note that I'm not timing the context creation... And

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread lalit sharma
Point to note as per docs as well : *Note that jars or python files that are passed to spark-submit should be URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically upload local jars.**http://spark.apache.org/docs/latest/running-on-mesos.html

Re: apache spark on gitter?

2016-05-11 Thread Paweł Szulc
no answer, but maybe one more time, a gitter channel for spark users would be a good idea! On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc wrote: > Hi, > > I was wondering - why Spark does not have a gitter channel? > > -- > Regards, > Paul Szulc > > twitter: @rabbitonweb >

Re: Save DataFrame to HBase

2016-05-11 Thread Ted Yu
Please note: The name of hbase table is specified in: def writeCatalog = s"""{ |"table":{"namespace":"default", "name":"table1"}, not by the: HBaseTableCatalog.newTable -> "5" FYI On Tue, May 10, 2016 at 3:11 PM, Ted Yu wrote: > I think so. > >

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Giri P
I'm not using docker On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > By any chance, are you using docker to execute? > On 11 May 2016 21:16, "Raghavendra Pandey" > wrote: > >> On 11 May 2016 02:13, "gpatcham"

Re: Setting Spark Worker Memory

2016-05-11 Thread Mich Talebzadeh
are you running this in standalone mode? that is one physical host, and the executor will live inside the driver. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark on DSE Cassandra with multiple data centers

2016-05-11 Thread Simone Franzini
I am running Spark on DSE Cassandra with multiple analytics data centers. It is my understanding that with this setup you should have a CFS file system for each data center. I was able to create an additional CFS file system as described here:

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
By any chance, are you using docker to execute? On 11 May 2016 21:16, "Raghavendra Pandey" wrote: > On 11 May 2016 02:13, "gpatcham" wrote: > > > > > > Hi All, > > > > I'm using --jars option in spark-submit to send 3rd party jars . But I >

Re: Setting Spark Worker Memory

2016-05-11 Thread شجاع الرحمن بیگ
yes, On Wed, May 11, 2016 at 5:43 PM, Deepak Sharma wrote: > Since you are registering workers from the same node , do you have enough > cores and RAM(In this case >=9 cores and > = 24 GB ) on this > node(11.14.224.24)? > > Thanks > Deepak > > On Wed, May 11, 2016 at 9:08

Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
On 11 May 2016 02:13, "gpatcham" wrote: > > Hi All, > > I'm using --jars option in spark-submit to send 3rd party jars . But I don't > see they are actually passed to mesos slaves. Getting Noclass found > exceptions. > > This is how I'm using --jars option > > --jars

Re: Setting Spark Worker Memory

2016-05-11 Thread Deepak Sharma
Since you are registering workers from the same node , do you have enough cores and RAM(In this case >=9 cores and > = 24 GB ) on this node(11.14.224.24)? Thanks Deepak On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ wrote: > Hi All, > > I need to set same memory and

Setting Spark Worker Memory

2016-05-11 Thread شجاع الرحمن بیگ
Hi All, I need to set same memory and core for each worker on same machine and for this purpose, I have set the following properties in conf/spark-env.sh export SPARK_EXECUTOR_INSTANCE=3 export SPARK_WORKER_CORES=3 export SPARK_WORKER_MEMORY=8g but only one worker is getting desired memory and

Re: Spark 1.6.0: substring on df.select

2016-05-11 Thread Raghavendra Pandey
You can create a column with count of /. Then take max of it and create that many columns for every row with null fillers. Raghav On 11 May 2016 20:37, "Bharathi Raja" wrote: Hi, I have a dataframe column col1 with values something like

Spark 1.6.0: substring on df.select

2016-05-11 Thread Bharathi Raja
Hi, I have a dataframe column col1 with values something like “/client/service/version/method”. The number of “/” are not constant. Could you please help me to extract all methods from the column col1? In Pig i used SUBSTRING with LAST_INDEX_OF(“/”). Thanks in advance. Regards, Raja

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a

dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Tony Jin
Hi guys, I have a problem about spark DataFrame. My spark version is 1.6.1. Basically, i used udf and df.withColumn to create a "new" column, and then i filter the values on this new columns and call show(action). I see the udf function (which is used to by withColumn to create the new column) is

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra
Will try with JSON relation, but with Spark's temp tables (Spark version 1.6 ) I get an optimized plan as you have mentioned. Should not be much different though. Query : "select t1.col2, t1.col3 from t1, t2 where t1.col1=t2.col1 and t1.col3=7" Plan : Project [COL2#1,COL3#2] +- Join Inner,

Re: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Ted Yu
Looks like the exception was thrown from this line: ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) Comment for taskBinary says: * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each * partition

Error: "Answer from Java side is empty"

2016-05-11 Thread AlexModestov
I use Sparkling Water 1.6.3, Spark 1.6.I use Java Oracle 8 or OpenJDK-7:(every time I get this error when I transform Spark DataFrame into H2O DataFrame. Spark cluster dies..):ERROR:py4j.java_gateway:Error while sending or receiving.Traceback (most recent call last): File

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Telmo Rodrigues
In this case, isn't better to perform the filter earlier as possible even there could be unhandled predicates? Telmo Rodrigues No dia 11/05/2016, às 09:49, Rishi Mishra escreveu: > It does push the predicate. But as a relations are generic and might or might > not

java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Daniel Haviv
Hi, I'm running a very simple job (textFile->map->groupby->count) and hitting this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception when running on yarn-client and not in local mode: 16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1,

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will find it will be 75%. So means 12GB memory. But it still do not make sense. Because my workload is very small. I use this spark to calculate on one csv file every 20 seconds. The size of the csv file is 1.3M. So spark

Re: [Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Ankit Singhal
You can use Joins as a substitute to subqueries. On Wed, May 11, 2016 at 1:27 PM, Divya Gehlot wrote: > Hi, > I am using Spark 1.5.2 with Apache Phoenix 4.4 > As Spark 1.5.2 doesn't support subquery in where conditions . >

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra
It does push the predicate. But as a relations are generic and might or might not handle some of the predicates , it needs to apply filter of un-handled predicates. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Wed, May 11,

Re: [Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Alonso Isidoro Roman
I think that Impala and Hive have this feature. Impala is faster than hive, hive is probably better to use in batch mode. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: How to use pyspark streaming module "slice"?

2016-05-11 Thread sethirot
ok, thanks anyway On Wed, May 11, 2016 at 12:15 AM, joyceye04 [via Apache Spark User List] < ml-node+s1001560n26919...@n3.nabble.com> wrote: > Not yet. And I turned to another way to bypass it just to finish my work. > Still waiting for answers :( > > -- > If you

Use Collaborative Filtering and Clustering Algorithm in Spark MLIB

2016-05-11 Thread Imre Nagi
Hi All, I'm newbie in spark mlib. In my office I have a statistician who work on improving our matrix model for our recommendation engine. However he works on R. He told me that it's quite possible to combine the collaborative filtering and latent dirichlet allocation (LDA) by doing some

[Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Apache Phoenix 4.4 As Spark 1.5.2 doesn't support subquery in where conditions . https://issues.apache.org/jira/browse/SPARK-4226 Is there any alternative way to find foreign key constraints. Would really appreciate the help. Thanks, Divya

Re: Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-11 Thread Mich Talebzadeh
Ok you can see that the process 10603 Worker is running as the worker/slave in your drive manager connection to GUI port webui-port 8081 spark://ES01:7077. That you can access through web Also you have process 12420 running as SparkSubmit. that is telling you the JVM you have submitted for this

not able to write to cassandra table from spark

2016-05-11 Thread anandnilkal
I am trying to write incoming stream data to database. Following is the example program, this code creates a thread to listen to incoming stream of data which is csv data. this data needs to be split with delimiter and the array of data needs to be pushed to database as separate columns in the

Spark hanging forever when doing decision tree training

2016-05-11 Thread Loic Quertenmont
Hello, I am new to spark and I am currently learning how to use classification algorithm with it. For now on I am playing with a rather small dataset and training a decision tree on my laptop (running with --master local[1]). However, systematically I see that my jobs are hanging forever at the