Re: Does SparkSql thrift server support insert/update/delete sql statement

2016-03-28 Thread Raymond Honderdors
It should Depensing on the storage used I am facing a simular issue running spark on emr I got emr login errors for insert Sent from Outlook Mobile On Mon, Mar 28, 2016 at 10:31 PM -0700, "sage" > wrote: Does SparkSql

Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-28 Thread Jacek Laskowski
Hi, How do you run the pipeline? Do you assembly or package? Is this on local or spark or other cluster manager? What's the build configuration? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

overriding spark.streaming.blockQueueSize default value

2016-03-28 Thread Spark Newbie
Hi All, The default value for spark.streaming.blockQueueSize is 10 in https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala. In spark kinesis asl 1.4 the received Kinesis records are stored by calling addData on line 115

Does SparkSql thrift server support insert/update/delete sql statement

2016-03-28 Thread sage
Does SparkSql thrift server support insert/update/delete sql statement when connecting using jdbc? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-SparkSql-thrift-server-support-insert-update-delete-sql-statement-tp26618.html Sent from the Apache Spark

run spark job

2016-03-28 Thread Fei Hu
Hi, I am wondering how to run the spark job by java command, such as: java -cp spark.jar mainclass. When running/debugging the spark program in IntelliJ IDEA, it uses java command to run spark main class, so I think it should be able to run the spark job by java command besides the

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
btw, they can't be saved to BSON either. This seems a generic issue, can anyone else reproduce this? On Mon, Mar 28, 2016 at 8:02 PM, Russell Jurney wrote: > I created a JIRA: https://issues.apache.org/jira/browse/SPARK-14229 > > On Mon, Mar 28, 2016 at 7:43 PM,

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-14229 On Mon, Mar 28, 2016 at 7:43 PM, Russell Jurney wrote: > Ted, I am using the .rdd method, see above, but for some reason these RDDs > can't be saved to MongoDB or ElasticSearch. > > I think this is a

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
Ted, I am using the .rdd method, see above, but for some reason these RDDs can't be saved to MongoDB or ElasticSearch. I think this is a bug in PySpark/DataFrame. I can't think of another explanation... somehow DataFrame.rdd RDDs are not able to be stored to an arbitrary Hadoop OutputFormat. When

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Ted Yu
See this method: lazy val rdd: RDD[T] = { On Mon, Mar 28, 2016 at 6:30 PM, Russell Jurney wrote: > Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. This > seems related to DataFrames. Is there a way to convert a DataFrame's RDD to > a 'normal'

Re: [SQL] Two columns in output vs one when joining DataFrames?

2016-03-28 Thread Divya Gehlot
Hi Jacek , The difference is being mentioned in Spark doc itself Note that if you perform a self-join using this function without aliasing the input * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since * there is no way to disambiguate which side of the join you

Master options Cluster/Client descrepencies.

2016-03-28 Thread satyajit vegesna
Hi All, I have written a spark program on my dev box , IDE:Intellij scala version:2.11.7 spark verison:1.6.1 run fine from IDE, by providing proper input and output paths including master. But when i try to deploy the code in my cluster made of below, Spark

Unsubscribe

2016-03-28 Thread Andrew Heinrichs
On Mar 29, 2016 8:56 AM, "Alexander Krasnukhin" wrote: > e.g. select max value for column "foo": > > from pyspark.sql.functions import max, col > df.select(max(col("foo"))).show() > > On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson < > a...@santacruzintegration.com> wrote:

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. This seems related to DataFrames. Is there a way to convert a DataFrame's RDD to a 'normal' RDD? On Mon, Mar 28, 2016 at 6:20 PM, Russell Jurney wrote: > I filed a JIRA

pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-28 Thread Andy Davidson
I am using pyspark spark-1.6.1-bin-hadoop2.6 and python3. I have a data frame with a column I need to convert to a sparse vector. I get an exception Any idea what my bug is? Kind regards Andy Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. :

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
Hi, Pal, thanks a lot, this can indeed help me. On Mon, Mar 28, 2016 at 10:44 PM, Sujit Pal wrote: > Hi Charles, > > I tried this with dummied out functions which just sum transformations of > a list of integers, maybe they could be replaced by algorithms in your > case.

PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Russell Jurney
I filed a JIRA in the mongo-hadoop project, but I'm curious if anyone else has seen this issue. Anyone have any idea what to do? I can't save to Mongo from PySpark. A contrived example works, but a dataframe does not. I activate pymongo_spark and load

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-28 Thread Alexander Krasnukhin
e.g. select max value for column "foo": from pyspark.sql.functions import max, col df.select(max(col("foo"))).show() On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using pyspark 1.6.1 and python3. > > > *Given:* > > idDF2 = idDF.select(idDF.id,

looking for an easy to to find the max value of a column in a data frame

2016-03-28 Thread Andy Davidson
I am using pyspark 1.6.1 and python3. Given: idDF2 = idDF.select(idDF.id, idDF.col.id ) idDF2.printSchema() idDF2.show() root |-- id: string (nullable = true) |-- col[id]: long (nullable = true) +--+--+ |id| col[id]| +--+--+ |1008930924| 534494917|

Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-28 Thread Timothy Potter
I'm seeing the following error when trying to generate a prediction from a very simple ML pipeline based model. I've verified that the raw data sent to the tokenizer is valid (not null). It seems like this is some sort of weird classpath or class loading type issue. Any help you can provide in

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
You mean that once a job is in a waiting queue, it won't take advantage of additional workers that happened to be added after the job was put into the waiting queue? That would be less than optimal. But it would be OK with us for now as long as the additional workers will be taken advantage of by

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Mich Talebzadeh
The ACID test will come when you start two or more Spark processes simultaneously. If you see queuing (i.e. second job waiting for the first job to finish in Spark GUI) then you may not have enough resources for Yarn to accommodate two jobs despite the additional worker process. Dr Mich

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
Yea, that seems to be the case. It seems that dynamically resizing a standalone Spark cluster is very simple. Thanks! On Mon, Mar 28, 2016 at 10:22 PM, Mich Talebzadeh wrote: > start-all start the master and anything else in slaves file > start-master.sh starts the

Re: Unable to Limit UI to localhost interface

2016-03-28 Thread Mich Talebzadeh
in your /etc/hosts what do you have for localhost 127.0.0.1 localhost.localdomain localhost conf/slave should have one entry in your case cat slaves # A Spark Worker will be started on each of the machines listed below. localhost ... Dr Mich Talebzadeh LinkedIn *

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Sunitha Kambhampati
Hi Jerry, I think you are running into an issue similar to SPARK-14040 https://issues.apache.org/jira/browse/SPARK-14040 One way to resolve it is to use alias. Here is an example that I tried on trunk and I do not see any exceptions.

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Mich Talebzadeh
Hi, I am using Spark 1.6.1 with Hive 2. I agree this may be a case to be resolved. I just happened to work around it. That first blank line causes val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header",

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Mich Talebzadeh
start-all start the master and anything else in slaves file start-master.sh starts the master only. I use start-slaves.sh for my purpose with added nodes to slaves file. When you run start-slave.sh you are creating another worker process on the master host. You can check the status on Spark

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Hyukjin Kwon
Could I ask which version are you using? It looks the cause is the empty line right after header (because that case is not being checked in tests). However, for empty lines before the header or inside date, they are being tested.

Re: This works to filter transactions older than certain months

2016-03-28 Thread Mich Talebzadeh
Forgot to mention Spark 1.6.1 Hive 2.0 Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 28 March 2016

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
It seems that the conf/slaves file is only for consumption by the following scripts: sbin/start-slaves.sh sbin/stop-slaves.sh sbin/start-all.sh sbin/stop-all.sh I.e., conf/slaves file doesn't affect a running cluster. Is this true? On Mon, Mar 28, 2016 at 9:31 PM, Sung Hwan Chung

[Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Hi spark users and developers, I'm using spark 1.5.1 (I have no choice because this is what we used). I ran into some very unexpected behaviour when I did some join operations lately. I cannot post my actual code here and the following code is not for practical reasons but it should demonstrate

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
No I didn't add it to the conf/slaves file. What I want to do is leverage auto-scale from AWS, without needing to stop all the slaves (e.g. if a lot of slaves are idle, terminate those). Also, the book-keeping is easier if I don't have to deal with some centralized list of slave list that needs

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Ashok Kumar
Hello Mich If you accommodate can you please share your approach to steps 1-3 above.  Best regards On Sunday, 27 March 2016, 14:53, Mich Talebzadeh wrote: Pretty simple as usual it is a combination of ETL and ELT. Basically csv files are loaded into staging

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Mich Talebzadeh
Have you added the slave host name to $SPARK_HOME/conf? Then you can use start-slaves.sh or stop-slaves.sh for all instances The assumption is that slave boxes have $SPARK_HOME installed in the same directory as $SPARK_HOME is installed in the master. HTH Dr Mich Talebzadeh LinkedIn *

Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
Hello, I found that I could dynamically add/remove new workers to a running standalone Spark cluster by simply triggering: start-slave.sh (SPARK_MASTER_ADDR) and stop-slave.sh E.g., I could instantiate a new AWS instance and just add it to a running cluster without needing to add it to slaves

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println

Re: This works to filter transactions older than certain months

2016-03-28 Thread Mich Talebzadeh
Snippet. import org.apache.spark.sql.functions._ import java.sql.{Date, Timestamp} val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header",

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Ted Yu
Can you describe what gets triggered by triggerAndWait ? Cheers On Mon, Mar 28, 2016 at 1:39 PM, kpeng1 wrote: > Hi All, > > I am currently trying to debug a spark application written in scala. I > have > a main method: > def main(args: Array[String]) { > ... >

println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread kpeng1
Hi All, I am currently trying to debug a spark application written in scala. I have a main method: def main(args: Array[String]) { ... SocialUtil.triggerAndWait(triggerUrl) ... The SocialUtil object is included in a seperate jar. I launched the spark-submit command using --jars

Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
To answer my own question, DataFrame.toJSON() does this, so there is no need to map and json.dump(): on_time_dataframe.toJSON().saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') Thanks! On Mon, Mar 28, 2016 at 12:54 PM, Russell Jurney wrote: > In

DataFrame --> JSON objects, instead of un-named array of fields

2016-03-28 Thread Russell Jurney
In PySpark, given a DataFrame, I am attempting to save it as JSON Lines/ndjson. I run this code: json_lines = on_time_dataframe.map(lambda x: json.dumps(x)) json_lines.saveAsTextFile('../data/On_Time_On_Time_Performance_2015.jsonl') This results in simple arrays of fields, instead of JSON

Re: This works to filter transactions older than certain months

2016-03-28 Thread Timur Shenkao
bq. CSV data is stored in an underlying table in Hive (actually created and populated as an ORC table by Spark) How is it possible? On Mon, Mar 28, 2016 at 1:50 AM, Mich Talebzadeh wrote: > Hi, > > A while back I was looking for functional programming to filter out >

NoSuchElementException in ChiSqSelector fit method (version 1.6.0)

2016-03-28 Thread jcason
I'm running into an error that's not making a lot of sense to me, and couldn't find sufficient info on the web to answer it myself. BTW, you can reply at Stack Overflow too: http://stackoverflow.com/questions/36254005/nosuchelementexception-in-chisqselector-fit-method-version-1-6-0 I've written

Re: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray error in nelwy build Hbase

2016-03-28 Thread Ted Yu
Dropping dev@ Can you provide a bit more information ? release of hbase release of hadoop I assume you're running on Linux. Any change in Linux setup before the exception showed up ? On Mon, Mar 28, 2016 at 10:30 AM, beeshma r wrote: > Hi > i am testing with newly build

RE: Non-classification neural networks

2016-03-28 Thread Ulanov, Alexander
Hi Jim, It is possible to use raw artificial neural networks by means of FeedForwardTrainer. It is [ml] package private, so your code should be in that package too. Basically, you need to do the same as it is done in MultilayerPerceptronClassifier but without encoding the output as one-hot:

Re: Aggregate subsequenty x row values together.

2016-03-28 Thread sujeet jog
Hi Ted, There is no row key persey, and i actually do not want to sort , want to aggregate the subsequent x rows together as a mean value maintaing the order of the row entries, For ex : - Input rdd [ 12, 45 ] [ 14, 50 ] [ 10, 35 ] [ 11, 50 ] expected output rdd , the below is actually a

Re: Aggregate subsequenty x row values together.

2016-03-28 Thread Ted Yu
Can you describe your use case a bit more ? Since the row keys are not sorted in your example, there is a chance that you get indeterministic results when you aggregate on groups of two successive rows. Thanks On Mon, Mar 28, 2016 at 9:21 AM, sujeet jog wrote: > Hi, > >

Re: Aggregate subsequenty x row values together.

2016-03-28 Thread Alexander Krasnukhin
So, why not make a fake key and aggregate on it? On Mon, Mar 28, 2016 at 6:21 PM, sujeet jog wrote: > Hi, > > I have a RDD like this . > > [ 12, 45 ] > [ 14, 50 ] > [ 10, 35 ] > [ 11, 50 ] > > i want to aggreate values of first two rows into 1 row and subsequenty the >

Aggregate subsequenty x row values together.

2016-03-28 Thread sujeet jog
Hi, I have a RDD like this . [ 12, 45 ] [ 14, 50 ] [ 10, 35 ] [ 11, 50 ] i want to aggreate values of first two rows into 1 row and subsequenty the next two rows into another single row... i don't have a key to aggregate for using some of the aggregate pyspark functions, how to achieve it ?

Re: --packages configuration equivalent item name?

2016-03-28 Thread Andy Davidson
Hi Russell I use Jupyter python notebooks a lot. Here is how I start the server set -x # turn debugging on #set +x # turn debugging off # https://github.com/databricks/spark-csv # http://spark-packages.org/package/datastax/spark-cassandra-connector

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread Sujit Pal
Hi Charles, I tried this with dummied out functions which just sum transformations of a list of integers, maybe they could be replaced by algorithms in your case. The idea is to call them through a "god" function that takes an additional type parameter and delegates out to the appropriate

Unable to Limit UI to localhost interface

2016-03-28 Thread David O'Gwynn
Greetings to all, I've search around the mailing list, but it would seem that (nearly?) everyone has the opposite problem as mine. I made a stab at looking in the source for an answer, but I figured I might as well see if anyone else has run into the same problem as I. I'm trying to limit my

Re: Unit testing framework for Spark Jobs?

2016-03-28 Thread Steve Loughran
this is a good summary -Have you thought of publishing it at the end of a URL for others to refer to > On 18 Mar 2016, at 07:05, Lars Albertsson wrote: > > I would recommend against writing unit tests for Spark programs, and > instead focus on integration tests of jobs or

Re: Custom RDD in spark, cannot find custom method

2016-03-28 Thread Ted Yu
You can run Zinc to speed up the build of Spark. Cheers > On Mar 27, 2016, at 10:15 PM, Tenghuan He wrote: > > Hi Ted > I changed > def customable(partitioner: Partitioner): RDD[(K, V)] = self.withScope { > to > def customable(partitioner: Partitioner): MyRDD[K, V] =

StackOverflow in updateStateByKey

2016-03-28 Thread Vikash Pareek
Hi, In my use case I need to maintain history data for a key. For this I am using updateStateByKey in which state is maintained as mutable scala collection(ArrayBuffer). Each element in ArrayBuffer is an incoming record. Spark version is 1.6 As number of elements(records) increases in the

Re: Reading Back a Cached RDD

2016-03-28 Thread aka.fe2s
Nick, what is your use-case? On Thu, Mar 24, 2016 at 11:55 PM, Marco Colombo wrote: > You can persist off-heap, for example with tachyon, now called Alluxio. > Take a look at off heap peristance > > Regards > > > Il giovedì 24 marzo 2016, Holden Karau

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread Holden Karau
You probably want to look at the map transformation, and the many more defined on RDDs. The function you pass in to map is serialized and the computation is distributed. On Monday, March 28, 2016, charles li wrote: > > use case: have a dataset, and want to use different

since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
use case: have a dataset, and want to use different algorithms on that, and fetch the result. for making this, I think I should distribute my algorithms, and run these algorithms on the dataset at the same time, am I right? but it seems that spark can not parallelize/serialize

Re: Twitter receiver not running in spark 1.6.0

2016-03-28 Thread UMESH CHAUDHARY
Could you post your code what you are using for streaming context ! On Mon, Mar 28, 2016 at 10:31 AM, lokeshkumar wrote: > Hi forum > > For some reason if I include a twitter receiver and start the streaming > context, I get the below exception not sure why > Can someone let