Best way to store Avro Objects as Parquet using SPARK

2016-03-20 Thread Manivannan Selvadurai
Hi All, In my current project there is a requirement to store avro data (json format) as parquet files. I was able to use AvroParquetWriter in separately to create the Parquet Files. The parquet files along with the data also had the 'avro schema' stored on them as a part of their

Re: best practices: running multi user jupyter notebook server

2016-03-20 Thread charles li
Hi, andy, I think you can make that with some open source packages/libs built for IPython and Spark. here is one : https://github.com/litaotao/IPython-Dashboard On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > We are considering deploying a notebook

Re: Error using collectAsMap() in scala

2016-03-20 Thread Shishir Anshuman
I have stored the contents of two csv files in separate RDDs. file1.csv format*: (column1,column2,column3)* file2.csv format*: (column1, column2)* *column1 of file1 *and* column2 of file2 *contains similar data. I want to compare the two columns and if match is found: - Replace the data at

Re: What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-20 Thread Dhaval Modi
+1 On Mar 21, 2016 09:52, "Hiroyuki Yamada" wrote: > Could anyone give me some advices or recommendations or usual ways to do > this ? > > I am trying to get all (probably top 100) product recommendations for each > user from a model (MatrixFactorizationModel), > but I

Re: Error using collectAsMap() in scala

2016-03-20 Thread Prem Sure
any specific reason you would like to use collectasmap only? You probably move to normal RDD instead of a Pair. On Monday, March 21, 2016, Mark Hamstra wrote: > You're not getting what Ted is telling you. Your `dict` is an RDD[String] > -- i.e. it is a collection of

Re: Error using collectAsMap() in scala

2016-03-20 Thread Mark Hamstra
You're not getting what Ted is telling you. Your `dict` is an RDD[String] -- i.e. it is a collection of a single value type, String. But `collectAsMap` is only defined for PairRDDs that have key-value pairs for their data elements. Both a key and a value are needed to collect into a Map[K, V].

Re: What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-20 Thread Hiroyuki Yamada
Could anyone give me some advices or recommendations or usual ways to do this ? I am trying to get all (probably top 100) product recommendations for each user from a model (MatrixFactorizationModel), but I haven't figured out yet to do it efficiently. So far, calling predict (predictAll in

Re: Get the number of days dynamically in with Column

2016-03-20 Thread Silvio Fiorito
I’m not entirely sure if this is what you’re asking, but you could just use the datediff function: val df2 = df.withColumn("ID”, datediff($"end", $"start”)) If you want it formatted as {n}D then: val df2 = df.withColumn("ID", concat(datediff($"end", $"start"),lit("D"))) Thanks, Silvio From:

Get the number of days dynamically in with Column

2016-03-20 Thread Divya Gehlot
I have a time stamping table which has data like No of Days ID 11D 22D and so on till 30 days Have another Dataframe with start date and end date I need to get the difference between these two days and get the ID from Time Stamping table and do With Column .

Spark: The build-in indexes in ORC file do not work.

2016-03-20 Thread Joseph
Hi all, Has anyone used ORC indexes in sparkSQL? Does SparkSQL support ORC indexes completely? I user the shell script "${SPARK_HOME}/bin/spark-sql" to run sparksql REPL and execute my query statement. The following is my test in sparksql REPL: spark-sql>set

Re: Building spark submodule source code

2016-03-20 Thread Ted Yu
To speed up the build process, take a look at install_zinc() in build/mvn, around line 83. And the following around line 137: # Now that zinc is ensured to be installed, check its status and, if its # not running or just installed, start it FYI On Sun, Mar 20, 2016 at 7:44 PM, Tenghuan He

Building spark submodule source code

2016-03-20 Thread Tenghuan He
Hi everyone, I am trying to add a new method to spark RDD. After changing the code of RDD.scala and running the following command mvn -pl :spark-core_2.10 -DskipTests clean install It BUILD SUCCESS, however, when starting the bin\spark-shell, my method cannot be found. Do I have

support vector machine question

2016-03-20 Thread prem09
Hi, I created a dataset of 100 points, ranging from X=1.0 to to X=100.0. I let the y variable be 0.0 if X < 51.0 and 1.0 otherwise. I then fit a SVMwithSGD. When I predict the y values for the same values of X as in the sample, I get back 1.0 for each predicted y! Incidentally, I don't get

Re: reading csv file, operation on column or columns

2016-03-20 Thread Mich Talebzadeh
Apologies. Good point def convertColumn(df: org.apache.spark.sql.DataFrame, name:String, newType:String) = { | val df_1 = df.withColumnRenamed(name, "ConvertColumn") | df_1.withColumn(name, df_1.col("ConvertColumn").cast(newType)).drop("ConvertColumn") | } val df_3 =

Re: reading csv file, operation on column or columns

2016-03-20 Thread Ted Yu
Mich: Looks like convertColumn() is method of your own - I don't see it in Spark code base. On Sun, Mar 20, 2016 at 3:38 PM, Mich Talebzadeh wrote: > Pretty straight forward as pointed out by Ted. > > --read csv file into a df > val df = >

Re: reading csv file, operation on column or columns

2016-03-20 Thread Mich Talebzadeh
Pretty straight forward as pointed out by Ted. --read csv file into a df val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") scala> df.printSchema root |-- Invoice Number: string (nullable = true) |--

Re: reading csv file, operation on column or columns

2016-03-20 Thread Ted Yu
Please refer to the following methods of DataFrame: def withColumn(colName: String, col: Column): DataFrame = { def drop(colName: String): DataFrame = { On Sun, Mar 20, 2016 at 2:47 PM, Ashok Kumar wrote: > Gurus, > > I would like to read a csv file into a

reading csv file, operation on column or columns

2016-03-20 Thread Ashok Kumar
Gurus, I would like to read a csv file into a Data Frame but able to rename the column name, change a column type from String to Integer or drop the column from further analysis before saving data as parquet file? Thanks

Re: Flume with Spark Streaming Sink

2016-03-20 Thread Luciano Resende
You should use it as described in the documentation and passing it as a package: ./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.10:1.6.1 ... On Sun, Mar 20, 2016 at 9:22 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to use the Spark

Re: Flume with Spark Streaming Sink

2016-03-20 Thread Ted Yu
$ jar tvf ./external/flume-sink/target/spark-streaming-flume-sink_2.10-1.6.1.jar | grep SparkFlumeProtocol 841 Thu Mar 03 11:09:36 PST 2016 org/apache/spark/streaming/flume/sink/SparkFlumeProtocol$Callback.class 2363 Thu Mar 03 11:09:36 PST 2016

Flume with Spark Streaming Sink

2016-03-20 Thread Daniel Haviv
Hi, I'm trying to use the Spark Sink with Flume but it seems I'm missing some of the dependencies. I'm running the following code: ./bin/spark-shell --master yarn --jars

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Marco Mistroni
Hi I try tomorrow same settings as you to see if I can experience same issues. Will report back once done Thanks On 20 Mar 2016 3:50 pm, "Vincent Ohprecio" wrote: > Thanks Mich and Marco for your help. I have created a ticket to look into > it on dev channel. > Here is the

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-20 Thread Surendra , Manchikanti
Hi, Can you check Kafka topic replication ? And leader information? Regards, Surendra M -- Surendra Manchikanti On Thu, Mar 17, 2016 at 7:28 PM, Ascot Moss wrote: > Hi, > > I have a SparkStream (with Kafka) job, after running several days, it > failed with following

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Vincent Ohprecio
Thanks Mich and Marco for your help. I have created a ticket to look into it on dev channel. Here is the issue https://issues.apache.org/jira/browse/SPARK-14031 On Sun, Mar 20, 2016 at 2:57 AM, Mich Talebzadeh wrote: > Hi Vincent, > > I downloads the CSV file and did

Re: Limit pyspark.daemon threads

2016-03-20 Thread Ted Yu
I took a look at docs/configuration.md Though I didn't find answer for your first question, I think the following pertains to your second question: spark.python.worker.memory 512m Amount of memory to use per python worker process during aggregation, in the same format as JVM

MLPC model can not be saved

2016-03-20 Thread HanPan
Hi Guys, I built a ML pipeline that includes multilayer perceptron classifier, I got the following error message when I tried to save the pipeline model. It seems like MLPC model can not be saved which means I have no ways to save the trained model. Is there any way to save the model

Re: Stanford CoreNLP sentiment extraction: lost executor

2016-03-20 Thread tundo
*SOLVED:* Unfortunately, stderr log in Hadoop's Resource Manager UI was not useful since it just reported "... Lost executor XX on workerYYY...". Therefore, I dumped locally the whole app-related logs: /yarn logs -applicationId application_1458320004153_0343 >

Re: spark-submit reset JVM

2016-03-20 Thread Ted Yu
Not that I know of. Can you be a little more specific on which JVM(s) you want restarted (assuming spark-submit is used to start a second job) ? Thanks On Sun, Mar 20, 2016 at 6:20 AM, Udo Fholl wrote: > Hi all, > > Is there a way for spark-submit to restart the JVM in

ClassNotFoundException in RDD.map

2016-03-20 Thread Dirceu Semighini Filho
Hello, I found a strange behavior after executing a prediction with MLIB. My code return an RDD[(Any,Double)] where Any is the id of my dataset, which is BigDecimal, and Double is the prediction for that line. When I run myRdd.take(10) it returns ok res16: Array[_ >: (Double, Double) <: (Any,

spark-submit reset JVM

2016-03-20 Thread Udo Fholl
Hi all, Is there a way for spark-submit to restart the JVM in the worker machines? Thanks. Udo.

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky
Can you share a snippet that reproduces the error? What was spark.sql.autoBroadcastJoinThreshold before your last change? On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový wrote: > Hi, > > any idea what could be causing this issue? It started appearing after > changing

Re: Subquery performance

2016-03-20 Thread Michael Armbrust
If you encode the data in something like parquet we usually have more information and will try to broadcast. On Thu, Mar 17, 2016 at 7:27 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Anyways to cache the subquery or force a broadcast join without persisting > it? > > > > y > > >

[no subject]

2016-03-20 Thread Vinay Varma

Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-20 Thread craigiggy
Also, this is the command I use to submit the Spark application: ** where *recommendation_engine-0.1-py2.7.egg* is a Python egg of my own library I've written for this application, and *'file'* and *'/home/spark/enigma_analytics/tests/msg-epims0730_small.json'* are input arguments for the

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Mich Talebzadeh
Hi Vincent, I downloads the CSV file and did the test. Spark version 1.5.2 The full code as follows. Minor changes to delete yearAndCancelled.parquet and output.csv files if they are already created //$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 val HiveContext =

Re: spark launching range is 10 mins

2016-03-20 Thread Enrico Rotundo
You might wanna try to assign more cores to the driver?! Sent from my iPhone > On 20 Mar 2016, at 07:34, Jialin Liu wrote: > > Hi, > I have set the partitions as 6000, and requested 100 nodes, with 32 > cores each node, > and the number of executors is 32 per node > >

Re: ClassNotFoundException in RDD.map

2016-03-20 Thread Jakob Odersky
The error is very strange indeed, however without code that reproduces it, we can't really provide much help beyond speculation. One thing that stood out to me immediately is that you say you have an RDD of Any where every Any should be a BigDecimal, so why not specify that type information? When

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton
In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander wrote: > Hi Charles, > > > > There is an implementation of multilayer perceptron

Re: df.dtypes -> pyspark.sql.types

2016-03-20 Thread Reynold Xin
We probably should have the alias. Is this still a problem on master branch? On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov wrote: > Running following: > > #fix schema for gaid which should not be Double >> from pyspark.sql.types import * >> customSchema = StructType()

Restarting an executor during execution causes it to lose AWS credentials (anyone seen this?)

2016-03-20 Thread Allen George
Hi guys, I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. Setup: I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple standalone cluster: 1 master 2 workers My

Re: spark launching range is 10 mins

2016-03-20 Thread Jialin Liu
Hi, I have set the partitions as 6000, and requested 100 nodes, with 32 cores each node, and the number of executors is 32 per node spark-submit --master $SPARKURL --executor-cores 32 --driver-memory 20G --executor-memory 80G single-file-test.py And I'm reading a 2.2 TB, the code, just has

bug spark should not use java.sql.timestamp was: sql timestamp timezone bug

2016-03-20 Thread Andy Davidson
Here is a nice analysis of the issue from the Cassandra mail list. (Datastax is the Databricks for Cassandra) Should I fill a bug? Kind regards Andy http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date and this one On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer