Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Jörn Franke
Well it could also depend on the receiving database. You should also check the executors. Updating to the latest version of the JDBC driver and JDK8, if supported by JDBC driver, could help. > On 20 Apr 2016, at 00:14, Jonathan Gray wrote: > > Hi, > > I'm trying to

Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Charles Nnamdi Akalugwu
I get the same error for fields which are not null unfortunately. Can't translate null value for field StructField(density,DecimalType(4,2),true) On Apr 21, 2016 1:37 AM, "Ted Yu" wrote: > The weight field is not nullable. > > Looks like your source table had null value for

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Takeshi Yamamuro
Sorry to wrongly send message in mid. How about trying to increate 'batchsize` in a jdbc option to improve performance? // maropu On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro wrote: > Hi, > > How about trying to increate 'batchsize > > On Wed, Apr 20, 2016 at 7:14

Re: VectorAssembler handling null values

2016-04-20 Thread Koert Kuipers
thanks for that, its good to know that functionality exists. but shouldn't a decision tree be able to handle missing (aka null) values more intelligently than simply using replacement values? see for example here:

Re: VectorAssembler handling null values

2016-04-20 Thread John Trengrove
You could handle null values by using the DataFrame.na functions in a preprocessing step like DataFrame.na.fill(). For reference: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions John On 21 April 2016 at 03:41, Andres Perez

bisecting kmeans tree

2016-04-20 Thread roni
Hi , I want to get the bisecting kmeans tree structure to show on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -R

HBase Spark Module

2016-04-20 Thread Benjamin Kim
I see that the new CDH 5.7 has been release with the HBase Spark module built-in. I was wondering if I could just download it and use the hbase-spark jar file for CDH 5.5. Has anyone tried this yet? Thanks, Ben - To

Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Ted Yu
The weight field is not nullable. Looks like your source table had null value for this field. On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu < cprenzb...@gmail.com> wrote: > Hi, > > I am using spark 1.4.1 and trying to copy all rows from a table in one > MySQL Database to a Amazon RDS

Re: spark on yarn

2016-04-20 Thread Mail.com
I get an error with a message that state what is max number of cores allowed. > On Apr 20, 2016, at 11:21 AM, Shushant Arora > wrote: > > I am running a spark application on yarn cluster. > > say I have available vcors in cluster as 100.And I start spark

StructField Translation Error with Spark SQL

2016-04-20 Thread Charles Nnamdi Akalugwu
Hi, I am using spark 1.4.1 and trying to copy all rows from a table in one MySQL Database to a Amazon RDS table using spark SQL. Some columns in the source table are defined as DECIMAL type and are nullable. Others are not. When I run my spark job, val writeData =

Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Actually you are correct. It will be considered a non logged operation that probably they (the DBAs) won't allow it in production. The only option for the thread owner is to perform smaller batches with frequent commits in MSSQL Dr Mich Talebzadeh LinkedIn *

RE: Spark SQL Transaction

2016-04-20 Thread Strange, Nick
Nologging means no redo log is generated (or minimal redo). However undo is still generated and the transaction will still be rolled back in the event of an issue. Nick From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Wednesday, April 20, 2016 4:08 PM To: Andrés Ivaldi Cc: user

Spark Streaming Job Question about retries and failover

2016-04-20 Thread map reduced
Hi, I have simple spark streaming application which reads data from Kafka and then send this data after transformation on a http end point (or another kafka - for this question let's consider http). I am submitting jobs using job-server . I am

Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Well Oracle will allow that if the underlying table is in NOLOOGING mode :) mtale...@mydb12.mich.LOCAL> create table testme(col1 int); Table created. mtale...@mydb12.mich.LOCAL> *alter table testme NOLOGGING;* Table altered. mtale...@mydb12.mich.LOCAL> insert into testme values(1); 1 row created.

Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
I think the same, and I don't think reducing batches size improves speed but will avoid loosing all data when rollback. Thanks for the help.. On Wed, Apr 20, 2016 at 4:03 PM, Mich Talebzadeh wrote: > yep. I think it is not possible to make SQL Server do a non

Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
yep. I think it is not possible to make SQL Server do a non logged transaction. Other alternative is doing inserts in small batches if possible. Or write to a CSV type file and use Bulk copy to load the file into MSSQL with frequent commits like every 50K rows? Dr Mich Talebzadeh LinkedIn *

Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
Yes, I know that behavior , but there is not explicit Begin Transaction in my code, so, maybe Spark or the same driver is adding the begin transaction, or implicit transaction is configured. If spark is'n adding a Begin transaction on each insertion, then probably is database or Driver

Fwd: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
You will see what is happening in SQL Server. First create a test table called testme 1> use tempdb 2> go 1> create table testme(col1 int) 2> go -- Now explicitly begin a transaction and insert 1 row and select from table 1> *begin tran*2> insert into testme values(1) 3> select * from testme 4>

Append is not working with data frame

2016-04-20 Thread Anil Langote
Hi All, We are pulling the data from oracle tables and writing them using partitions as parquet files, this is daily process it works fine till 18th day (18 days load works fine), however on 19 th day load the data frame load process hangs and load action called more than once, if we remove

Unable to improve ListStatus performance of ParquetRelation

2016-04-20 Thread Ditesh Kumar
Hi, When creating a DataFrame from a partitioned file structure ( sqlContext.read.parquet("s3://bucket/path/to/partitioned/parquet/filles") ), takes a lot of time to get list of files recursively from S3 when large number of files are involved. To circumvent this I wanted to override the

Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
Sorry I'cant answer before, I want to know if spark is the responsible to add the Begin Tran, The point is to speed up insertion over losing data, Disabling Transaction will speed up the insertion and we dont care about consistency... I'll disable te implicit_transaction and see what happens.

Re: VectorAssembler handling null values

2016-04-20 Thread Andres Perez
so the missing data could be on a one-off basis, or from fields that are in general optional, or from, say, a count that is only relevant for certain cases (very sparse): f1|f2|f3|optF1|optF2|sparseF1 a|15|3.5|cat1|142L| b|13|2.4|cat2|64L|catA c|2|1.6||| d|27|5.1||0| -Andy On Wed, Apr 20, 2016

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Hi, you do not need to do anything with the RDD at all. Just follow the instructions in this site https://github.com/databricks/spark-csv and everything will be super fast and smooth. Remember that in case the data is large then converting RDD to dataframes takes a very very very very long time.

Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin From: Sourav Mazumder To: user Sent: Wednesday, April 20, 2016 11:07 AM Subject: Spark 2.0 forthcoming features Hi All, Is there

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Found it. In case someone else if looking for this: cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen wrote: > Hi All, > > I am using the example of model selection via

Spark 2.0 forthcoming features

2016-04-20 Thread Sourav Mazumder
Hi All, Is there somewhere we can get idea of the upcoming features in Spark 2.0. I got a list for Spark ML from here https://issues.apache.org/jira/browse/SPARK-12626. Is there other links where I can similar enhancements planned for Sparl SQL, Spark Core, Spark Streaming. GraphX etc. ?

Re: Invoking SparkR from Spark shell

2016-04-20 Thread Ted Yu
Please take a look at: https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes On Wed, Apr 20, 2016 at 9:50 AM, Ashok Kumar wrote: > Hi, > > I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R > with Spark. > > Is there a s hell

Invoking SparkR from Spark shell

2016-04-20 Thread Ashok Kumar
Hi, I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with Spark. Is there a s hell similar to spark-shell that supports R besides Scala please? Thanks

custom transformer pipeline sample code

2016-04-20 Thread Andy Davidson
Someone recently asked me for a code example of how to to write a custom pipeline transformer in Java Enjoy, Share Andy https://github.com/AEDWIP/Spark-Naive-Bayes-text-classification/blob/260a6b9 b67d7da42c1d0f767417627da342c8a49/src/test/java/com/santacruzintegration/spa

Executor still on the UI even if the worker is dead

2016-04-20 Thread kundan kumar
Hi TD/Cody, Why does it happen so in Spark Streaming that the executors are still shown on the UI even when the worker is killed and not in the cluster. This severely impacts my running jobs which takes too longer and the stages failing with the exception java.io.IOException: Failed to connect

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Wei Chen
Let's assume K is String, and V is Integer, schema = StructType([StructField("K", StringType(), True), StructField("V", IntegerType(), True)]) df = sqlContext.createDataFrame(rdd, schema=schema) udf1 = udf(lambda x: [x], ArrayType(IntegerType())) df1 = df.select("K", udf1("V").alias("arrayV"))

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Forgot to mention, I am using 1.5.2 Scala version. On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen wrote: > Hi All, > > I am using the example of model selection via cross-validation from the > documentation here: http://spark.apache.org/docs/latest/ml-guide.html. > After I

spark on yarn

2016-04-20 Thread Shushant Arora
I am running a spark application on yarn cluster. say I have available vcors in cluster as 100.And I start spark application with --num-executors 200 --num-cores 2 (so I need total 200*2=400 vcores) but in my cluster only 100 are available. What will happen ? Will the job abort or it will be

Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Assuming that you are using JDBC for putting data into any ACID compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly adding BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL does not know or care where data is coming from. If your connection completes

Re: pyspark split pair rdd to multiple

2016-04-20 Thread patcharee
I can also use dataframe. Any suggestions? Best, Patcharee On 20. april 2016 10:43, Gourav Sengupta wrote: Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 > wrote:

Re: 回复:Spark sql and hive into different result with same sql

2016-04-20 Thread Ted Yu
Do you mind trying out build from master branch ? 1.5.3 is a bit old. On Wed, Apr 20, 2016 at 5:25 AM, FangFang Chen wrote: > I found spark sql lost precision, and handle data as int with some rule. > Following is data got via hive shell and spark sql, with same sql

reading EOF exception while reading parquet ile from hadoop

2016-04-20 Thread Naveen Kumar Pokala
Hi, I am trying to read parquet file(for ex: one.parquet) I am creating rdd out of it like .. My program In scala like below... val data = sqlContext.read.parquet("hdfs://machine:port/home/user/one.parquet").rdd.map { x => (x.getString(0),x) } data.count() I am using spark 1.4 and Hadoop

Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-20 Thread pierre lacave
Hi I am trying to use spark to write to a protected zone in hdfs, I am able to create and list file using the hdfs client but when writing via Spark I get this exception. I could not find any mention of CryptoProtocolVersion in the spark doc. Any idea what could have gone wrong? spark

Re: Exceeding spark.akka.frameSize when saving Word2VecModel

2016-04-20 Thread Stefan Falk
Nobody here who can help me on this? :/ On 19/04/16 13:15, Stefan Falk wrote: Hello Sparklings! I am trying to train a word vector model but as I call Word2VecModel#save() I am getting a org.apache.spark.SparkException saying that this would exceed the frameSize limit (stackoverflow

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Jörn Franke
Python can access the JVM - this how it interfaces with Spark. Some of the components do not have a wrapper fro the corresponding Java Api yet and thus are not accessible in Python. Same for elastic search. You need to write a more or less simple wrapper. > On 20 Apr 2016, at 09:53,

Re: Any NLP lib could be used on spark?

2016-04-20 Thread Chris Fregly
this took me a bit to get working, but I finally got it up and running so with the package that Burak pointed out. here's some relevant links to my project that should give you some clues:

Re: Spark support for Complex Event Processing (CEP)

2016-04-20 Thread Mario Ds Briggs
I did see your earlier post about Stratio decision. Will readup on it thanks Mario From: Alonso Isidoro Roman To: Mich Talebzadeh Cc: Mario Ds Briggs/India/IBM@IBMIN, Luciano Resende , "user @spark"

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 wrote: > Hi, > > How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in > Pyspark? > > Best, > Patcharee > >

Re:Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread 李明伟
Hi the input data size is less than 10M. The task result size should be less I think. Because I am doing aggregation on the data At 2016-04-20 16:18:31, "Jeff Zhang" wrote: Do you mean the input data size as 10M or the task result size ? >>> But my way is to setup

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang
Do you mean the input data size as 10M or the task result size ? >>> But my way is to setup a forever loop to handle continued income data. Not sure if it is the right way to use spark Not sure what this mean, do you use spark-streaming, for doing batch job in the forever loop ? On Wed, Apr

Re:Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread 李明伟
Hi Jeff The total size of my data is less than 10M. I already set the driver memory to 4GB. 在 2016-04-20 13:42:25,"Jeff Zhang" 写道: Seems it is OOM in driver side when fetching task result. You can try to increase spark.driver.memory and

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread kramer2...@126.com
I am using python and spark. I think one problem might be to communicate spark with third product. For example, combine spark with elasticsearch. You have to use java or scala. Python is not supported -- View this message in context:

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Zhang, Jingyu
Graphx did not support Python yet. http://spark.apache.org/docs/latest/graphx-programming-guide.html The workaround solution is use graphframes (3rd party API), https://issues.apache.org/jira/browse/SPARK-3789 but some features in Python are not as same as Scala,

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread sujeet jog
It depends on the trade off's you wish to have, Python being a interpreted language, speed of execution will be lesser, but it being a very common language used across, people can jump in hands on quickly Scala programs run in java environment, so it's obvious you will get good execution speed,

Scala vs Python for Spark ecosystem

2016-04-20 Thread berkerkozan
I know scala better than python but my team (2 other my friend) knows only python. We want to use graphx or maybe try graphframes. What will be the future of these 2 languages for spark ecosystem? Will python cover everything scala can in short time periods? what do you advice? -- View this

Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Are you using JDBC to push data to MSSQL? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 19 April