Spark Streaming: Combine MLlib Prediction and Features on Dstreams

2016-05-26 Thread obaidul karim
Hi Guys,

This is my first mail to spark users mailing list.

I need help on Dstream operation.

In fact, I am using a MLlib randomforest model to predict using spark
streaming. In the end, I want to combine the feature Dstream & prediction
Dstream together for further downstream processing.

I am predicting using below piece of code:

predictions = texts.map( lambda x : getFeatures(x) ).map(lambda x :
x.split(',')).map( lambda parts : [float(i) for i in parts]
).transform(lambda rdd: rf_model.predict(rdd))

Here texts is dstream having single line of text as records
getFeatures generates a comma separated features extracted from each record


I want the output as below tuple:
("predicted value", "original text")

How can I achieve that ?
or
at least can I perform .zip like normal RDD operation on two Dstreams, like
below:
output = texts.zip(predictions)


Thanks in advance.

-Obaid


Sample scala program using CMU Sphinx4

2016-05-26 Thread Vajra L
Folks,
Does anyone have a sample program for Speech recognition in CMU Sphinx4 on 
Spark Scala?

Thanks.
Vajra
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
yeah that could work, since i should know (or be able to find out) all the
input columns

On Thu, May 26, 2016 at 11:30 PM, Takeshi Yamamuro 
wrote:

> You couldn't do like this?
>
> --
> val func = udf((i: Int) => Tuple2(i, i))
> val df = Seq((1, ..., 0), (2, ..., 5)).toDF("input", "c0", "c1", other
> needed columns, "cX")
> df.select(func($"a").as("r"), $"c0", $"c1", $"cX").select($"r._1",
> $"r._2", $"c0", $"c1", $"cX")
>
> // maropu
>
>
> On Fri, May 27, 2016 at 12:15 PM, Koert Kuipers  wrote:
>
>> yes, but i also need all the columns (plus of course the 2 new ones) in
>> my output. your select operation drops all the input columns.
>> best, koert
>>
>> On Thu, May 26, 2016 at 11:02 PM, Takeshi Yamamuro > > wrote:
>>
>>> Couldn't you include all the needed columns in your input dataframe?
>>>
>>> // maropu
>>>
>>> On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers 
>>> wrote:
>>>
 that is nice and compact, but it does not add the columns to an
 existing dataframe

 On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro <
 linguin@gmail.com> wrote:

> Hi,
>
> How about this?
> --
> val func = udf((i: Int) => Tuple2(i, i))
> val df = Seq((1, 0), (2, 5)).toDF("a", "b")
> df.select(func($"a").as("r")).select($"r._1", $"r._2")
>
> // maropu
>
>
> On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers 
> wrote:
>
>> hello all,
>>
>> i have a single udf that creates 2 outputs (so a tuple 2). i would
>> like to add these 2 columns to my dataframe.
>>
>> my current solution is along these lines:
>> df
>>   .withColumn("_temp_", udf(inputColumns))
>>   .withColumn("x", col("_temp_)("_1"))
>>   .withColumn("y", col("_temp_")("_2"))
>>   .drop("_temp_")
>>
>> this works, but its not pretty with the temporary field stuff.
>>
>> i also tried this:
>> val tmp = udf(inputColumns)
>> df
>>   .withColumn("x", tmp("_1"))
>>   .withColumn("y", tmp("_2"))
>>
>> this also works, but unfortunately the udf is evaluated twice
>>
>> is there a better way to do this?
>>
>> thanks! koert
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Spark input size when filtering on parquet files

2016-05-26 Thread Takeshi Yamamuro
Hi,

Spark just prints #bytes in the web UI that is accumulated from
InputSplit#getLength (it is just a length of files).
Therefore, I'm afraid this metric does not reflect actual read #bytes for
parquet.
If you get the metric, you need to use other tools such as iostat or
something.

// maropu


// maropu


On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker 
wrote:

> Hi all
>
>
>
> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to
> find out about the improvements made in filtering/scanning parquet files
> when querying for tables using SparkSQL and how these changes relate to the
> new filter API introduced in Parquet 1.7.0.
>
>
>
> After checking the usual sources, I still can’t make sense of some of the
> numbers shown on the Spark UI. As an example, I’m looking at the collect
> stage for a query that’s selecting a single row from a table containing 1
> million numbers using a simple where clause (i.e. col1 = 50) and this
> is what I see on the UI:
>
>
>
> 0 SUCCESS ... 2.4 MB (hadoop) / 0
>
> 1 SUCCESS ... 2.4 MB (hadoop) / 25
>
> 2 SUCCESS ... 2.4 MB (hadoop) / 0
>
> 3 SUCCESS ... 2.4 MB (hadoop) / 0
>
>
>
> Based on the min/max statistics of each of the parquet parts, it makes
> sense not to expect any records for 3 out of the 4, because the record I’m
> looking for can only be in a single file. But why is the input size above
> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
> whole stage? Isn't it just meant to read the metadata and ignore the
> content of the file?
>
>
>
> Regards,
>
> Dennis
>



-- 
---
Takeshi Yamamuro


Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Ted Yu
Priya:
Have you checked the executor logs on hostname1 and hostname2 ?

Cheers

On Thu, May 26, 2016 at 8:00 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> If you get stuck in job fails, one of best practices is to increase
> #partitions.
> Also, you'd better off using DataFrame instread of RDD in terms of join
> optimization.
>
> // maropu
>
>
> On Thu, May 26, 2016 at 11:40 PM, Priya Ch 
> wrote:
>
>> Hello Team,
>>
>>
>>  I am trying to perform join 2 rdds where one is of size 800 MB and the
>> other is 190 MB. During the join step, my job halts and I don't see
>> progress in the execution.
>>
>> This is the message I see on console -
>>
>> INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
>> locations for shuffle 0 to :4
>> INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
>> locations for shuffle 1 to :4
>>
>> After these messages, I dont see any progress. I am using Spark 1.6.0
>> version and yarn scheduler (running in YARN client mode). My cluster
>> configurations is - 3 node cluster (1 master and 2 slaves). Each slave has
>> 1 TB hard disk space, 300GB memory and 32 cores.
>>
>> HDFS block size is 128 MB.
>>
>> Thanks,
>> Padma Ch
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
yes, but i also need all the columns (plus of course the 2 new ones) in my
output. your select operation drops all the input columns.
best, koert

On Thu, May 26, 2016 at 11:02 PM, Takeshi Yamamuro 
wrote:

> Couldn't you include all the needed columns in your input dataframe?
>
> // maropu
>
> On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers  wrote:
>
>> that is nice and compact, but it does not add the columns to an existing
>> dataframe
>>
>> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro > > wrote:
>>
>>> Hi,
>>>
>>> How about this?
>>> --
>>> val func = udf((i: Int) => Tuple2(i, i))
>>> val df = Seq((1, 0), (2, 5)).toDF("a", "b")
>>> df.select(func($"a").as("r")).select($"r._1", $"r._2")
>>>
>>> // maropu
>>>
>>>
>>> On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers 
>>> wrote:
>>>
 hello all,

 i have a single udf that creates 2 outputs (so a tuple 2). i would like
 to add these 2 columns to my dataframe.

 my current solution is along these lines:
 df
   .withColumn("_temp_", udf(inputColumns))
   .withColumn("x", col("_temp_)("_1"))
   .withColumn("y", col("_temp_")("_2"))
   .drop("_temp_")

 this works, but its not pretty with the temporary field stuff.

 i also tried this:
 val tmp = udf(inputColumns)
 df
   .withColumn("x", tmp("_1"))
   .withColumn("y", tmp("_2"))

 this also works, but unfortunately the udf is evaluated twice

 is there a better way to do this?

 thanks! koert

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Takeshi Yamamuro
Couldn't you include all the needed columns in your input dataframe?

// maropu

On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers  wrote:

> that is nice and compact, but it does not add the columns to an existing
> dataframe
>
> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> How about this?
>> --
>> val func = udf((i: Int) => Tuple2(i, i))
>> val df = Seq((1, 0), (2, 5)).toDF("a", "b")
>> df.select(func($"a").as("r")).select($"r._1", $"r._2")
>>
>> // maropu
>>
>>
>> On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers  wrote:
>>
>>> hello all,
>>>
>>> i have a single udf that creates 2 outputs (so a tuple 2). i would like
>>> to add these 2 columns to my dataframe.
>>>
>>> my current solution is along these lines:
>>> df
>>>   .withColumn("_temp_", udf(inputColumns))
>>>   .withColumn("x", col("_temp_)("_1"))
>>>   .withColumn("y", col("_temp_")("_2"))
>>>   .drop("_temp_")
>>>
>>> this works, but its not pretty with the temporary field stuff.
>>>
>>> i also tried this:
>>> val tmp = udf(inputColumns)
>>> df
>>>   .withColumn("x", tmp("_1"))
>>>   .withColumn("y", tmp("_2"))
>>>
>>> this also works, but unfortunately the udf is evaluated twice
>>>
>>> is there a better way to do this?
>>>
>>> thanks! koert
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Takeshi Yamamuro
Hi,

If you get stuck in job fails, one of best practices is to increase
#partitions.
Also, you'd better off using DataFrame instread of RDD in terms of join
optimization.

// maropu


On Thu, May 26, 2016 at 11:40 PM, Priya Ch 
wrote:

> Hello Team,
>
>
>  I am trying to perform join 2 rdds where one is of size 800 MB and the
> other is 190 MB. During the join step, my job halts and I don't see
> progress in the execution.
>
> This is the message I see on console -
>
> INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
> locations for shuffle 0 to :4
> INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
> locations for shuffle 1 to :4
>
> After these messages, I dont see any progress. I am using Spark 1.6.0
> version and yarn scheduler (running in YARN client mode). My cluster
> configurations is - 3 node cluster (1 master and 2 slaves). Each slave has
> 1 TB hard disk space, 300GB memory and 32 cores.
>
> HDFS block size is 128 MB.
>
> Thanks,
> Padma Ch
>



-- 
---
Takeshi Yamamuro


RE: Not able to write output to local filsystem from Standalone mode.

2016-05-26 Thread Yong Zhang
That just makes sense, doesn't it?
The only place will be driver. If not, the executor will be having contention 
by whom should create the directory in this case.
Only the coordinator (driver in this case) is the best place for doing it.
Yong

From: math...@closetwork.org
Date: Wed, 25 May 2016 18:23:02 +
Subject: Re: Not able to write output to local filsystem from Standalone mode.
To: ja...@japila.pl
CC: stutiawas...@hcl.com; user@spark.apache.org

Experience. I don't use Mesos or Yarn or Hadoop, so I don't know.

On Wed, May 25, 2016 at 2:51 AM Jacek Laskowski  wrote:
Hi Mathieu,



Thanks a lot for the answer! I did *not* know it's the driver to

create the directory.



You said "standalone mode", is this the case for the other modes -

yarn and mesos?



p.s. Did you find it in the code or...just experienced before? #curious



Pozdrawiam,

Jacek Laskowski



https://medium.com/@jaceklaskowski/

Mastering Apache Spark http://bit.ly/mastering-apache-spark

Follow me at https://twitter.com/jaceklaskowski





On Tue, May 24, 2016 at 4:04 PM, Mathieu Longtin  wrote:

> In standalone mode, executor assume they have access to a shared file

> system. The driver creates the directory and the executor write files, so

> the executors end up not writing anything since there is no local directory.

>

> On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi  wrote:

>>

>> hi Jacek,

>>

>> Parent directory already present, its my home directory. Im using Linux

>> (Redhat) machine 64 bit.

>> Also I noticed that "test1" folder is created in my master with

>> subdirectory as "_temporary" which is empty. but on slaves, no such

>> directory is created under /home/stuti.

>>

>> Thanks

>> Stuti

>> 

>> From: Jacek Laskowski [ja...@japila.pl]

>> Sent: Tuesday, May 24, 2016 5:27 PM

>> To: Stuti Awasthi

>> Cc: user

>> Subject: Re: Not able to write output to local filsystem from Standalone

>> mode.

>>

>> Hi,

>>

>> What happens when you create the parent directory /home/stuti? I think the

>> failure is due to missing parent directories. What's the OS?

>>

>> Jacek

>>

>> On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:

>>

>> Hi All,

>>

>> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2

>> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch

>> shell , read the input file from local filesystem and perform transformation

>> successfully. When I try to write my output in local filesystem path then I

>> receive below error .

>>

>>

>>

>> I tried to search on web and found similar Jira :

>> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows

>> resolved for Spark 1.3+ but already people have posted the same issue still

>> persists in latest versions.

>>

>>

>>

>> ERROR

>>

>> scala> data.saveAsTextFile("/home/stuti/test1")

>>

>> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,

>> server1): java.io.IOException: The temporary job-output directory

>> file:/home/stuti/test1/_temporary doesn't exist!

>>

>> at

>> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)

>>

>> at

>> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)

>>

>> at

>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)

>>

>> at

>> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)

>>

>> at

>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)

>>

>> at

>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)

>>

>> at

>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

>>

>> at org.apache.spark.scheduler.Task.run(Task.scala:89)

>>

>> at

>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

>>

>> at

>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

>>

>> at

>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

>>

>> at java.lang.Thread.run(Thread.java:745)

>>

>>

>>

>> What is the best way to resolve this issue if suppose I don’t want to have

>> Hadoop installed OR is it mandatory to have Hadoop to write the output from

>> Standalone cluster mode.

>>

>>

>>

>> Please suggest.

>>

>>

>>

>> Thanks 

>>

>> Stuti Awasthi

>>

>>

>>

>>

>>

>> ::DISCLAIMER::

>>

>> 

>>

>> The contents of this e-mail and any attachment(s) are confidential and

>> intended for the named recipient(s) only.

>> 

Re: Pros and Cons

2016-05-26 Thread Koert Kuipers
We do disk-to-disk iterative algorithms in spark all the time, on datasets
that do not fit in memory, and it works well for us. I usually have to do
some tuning of number of partitions for a new dataset but that's about it
in terms of inconveniences.
On May 26, 2016 2:07 AM, "Jörn Franke"  wrote:


Spark can handle this true, but it is optimized for the idea that it works
it works on the same full dataset in-memory due to the underlying nature of
machine learning algorithms (iterative). Of course, you can spill over, but
that you should avoid.

That being said you should have read my final sentence about this. Both
systems develop and change.


On 25 May 2016, at 22:14, Reynold Xin  wrote:


On Wed, May 25, 2016 at 9:52 AM, Jörn Franke  wrote:

> Spark is more for machine learning working iteravely over the whole same
> dataset in memory. Additionally it has streaming and graph processing
> capabilities that can be used together.
>

Hi Jörn,

The first part is actually no true. Spark can handle data far greater than
the aggregate memory available on a cluster. The more recent versions
(1.3+) of Spark have external operations for almost all built-in operators,
and while things may not be perfect, those external operators are becoming
more and more robust with each version of Spark.


Re: Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
Done, version 1.6.1 has the fix, updated and work fine

Thanks.

On Thu, May 26, 2016 at 4:15 PM, Anthony May  wrote:

> It's on the 1.6 branch
>
> On Thu, May 26, 2016 at 4:43 PM Andrés Ivaldi  wrote:
>
>> I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe
>> it's in 1.6.1 looking at the history.
>> thanks I'll see if update spark  to 1.6.1
>>
>> On Thu, May 26, 2016 at 3:33 PM, Anthony May 
>> wrote:
>>
>>> It doesn't appear to be configurable, but it is inserting by column name:
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102
>>>
>>> On Thu, 26 May 2016 at 16:02 Andrés Ivaldi  wrote:
>>>
 Hello,
  I'realize that when dataframe executes insert it is inserting by
 scheme order column instead by name, ie

 dataframe.write(SaveMode).jdbc(url, table, properties)

 Reading the profiler the execution is

 insert into TableName values(a,b,c..)

 what i need is
 insert into TableNames (colA,colB,colC) values(a,b,c)

 could be some configuration?

 regards.

 --
 Ing. Ivaldi Andres

>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Re: Insert into JDBC

2016-05-26 Thread Anthony May
It's on the 1.6 branch
On Thu, May 26, 2016 at 4:43 PM Andrés Ivaldi  wrote:

> I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe
> it's in 1.6.1 looking at the history.
> thanks I'll see if update spark  to 1.6.1
>
> On Thu, May 26, 2016 at 3:33 PM, Anthony May  wrote:
>
>> It doesn't appear to be configurable, but it is inserting by column name:
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102
>>
>> On Thu, 26 May 2016 at 16:02 Andrés Ivaldi  wrote:
>>
>>> Hello,
>>>  I'realize that when dataframe executes insert it is inserting by scheme
>>> order column instead by name, ie
>>>
>>> dataframe.write(SaveMode).jdbc(url, table, properties)
>>>
>>> Reading the profiler the execution is
>>>
>>> insert into TableName values(a,b,c..)
>>>
>>> what i need is
>>> insert into TableNames (colA,colB,colC) values(a,b,c)
>>>
>>> could be some configuration?
>>>
>>> regards.
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


Re: Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
I see, I'm using Spark 1.6.0 and that change seems to be for 2.0 or maybe
it's in 1.6.1 looking at the history.
thanks I'll see if update spark  to 1.6.1

On Thu, May 26, 2016 at 3:33 PM, Anthony May  wrote:

> It doesn't appear to be configurable, but it is inserting by column name:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102
>
> On Thu, 26 May 2016 at 16:02 Andrés Ivaldi  wrote:
>
>> Hello,
>>  I'realize that when dataframe executes insert it is inserting by scheme
>> order column instead by name, ie
>>
>> dataframe.write(SaveMode).jdbc(url, table, properties)
>>
>> Reading the profiler the execution is
>>
>> insert into TableName values(a,b,c..)
>>
>> what i need is
>> insert into TableNames (colA,colB,colC) values(a,b,c)
>>
>> could be some configuration?
>>
>> regards.
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Re: Insert into JDBC

2016-05-26 Thread Anthony May
It doesn't appear to be configurable, but it is inserting by column name:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L102

On Thu, 26 May 2016 at 16:02 Andrés Ivaldi  wrote:

> Hello,
>  I'realize that when dataframe executes insert it is inserting by scheme
> order column instead by name, ie
>
> dataframe.write(SaveMode).jdbc(url, table, properties)
>
> Reading the profiler the execution is
>
> insert into TableName values(a,b,c..)
>
> what i need is
> insert into TableNames (colA,colB,colC) values(a,b,c)
>
> could be some configuration?
>
> regards.
>
> --
> Ing. Ivaldi Andres
>


Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Reynold Xin
It's probably a good idea to have the vertica dialect too, since it doesn't
seem like it'd be too difficult to maintain. It is not going to be as
performant as the native Vertica data source, but is going to be much
lighter weight.


On Thu, May 26, 2016 at 3:09 PM, Mohammed Guller 
wrote:

> Vertica also provides a Spark connector. It was not GA the last time I
> looked at it, but available on the Vertica community site. Have you tried
> using the Vertica Spark connector instead of the JDBC driver?
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Aaron Ilovici [mailto:ailov...@wayfair.com]
> *Sent:* Thursday, May 26, 2016 8:08 AM
> *To:* user@spark.apache.org; d...@spark.apache.org
> *Subject:* JDBC Dialect for saving DataFrame into Vertica Table
>
>
>
> I am attempting to write a DataFrame of Rows to Vertica via
> DataFrameWriter's jdbc function in the following manner:
>
>
>
> dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties);
>
>
>
> This works when there are no NULL values in any of the Rows in my
> DataFrame. However, when there are rows, I get the following error:
>
>
>
> ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
>
> java.sql.SQLFeatureNotSupportedException: [Vertica][JDBC](10220) Driver
> not capable.
>
> at com.vertica.exceptions.ExceptionConverter.toSQLException(Unknown
> Source)
>
> at
> com.vertica.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown
> Source)
>
> at com.vertica.jdbc.common.SPreparedStatement.setNull(Unknown Source)
>
>
>
> This appears to be Spark's attempt to set a null value in a
> PreparedStatement, but Vertica does not understand the type upon executing
> the transaction. I see in JdbcDialects.scala that there are dialects for
> MySQL, Postgres, DB2, MsSQLServer, Derby, and Oracle.
>
>
>
> 1 - Would writing a dialect for Vertica eleviate the issue, by setting a
> 'NULL' in a type that Vertica would understand?
>
> 2 - What would be the best way to do this without a Spark patch? Scala,
> Java, make a jar and call 'JdbcDialects.registerDialect(VerticaDialect)'
> once created?
>
> 3 - Where would one find the proper mapping between Spark DataTypes and
> Vertica DataTypes? I don't see 'NULL' handling for any of the dialects,
> only the base case 'case _ => None' - is None mapped to the proper NULL
> type elsewhere?
>
>
>
> My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7
>
>
>
> I would be happy to create a Jira and submit a pull request with the
> VerticaDialect once I figure this out.
>
>
>
> Thank you for any insight on this,
>
>
>
> *AARON ILOVICI*
> Software Engineer
>
> Marketing Engineering
>
> *WAYFAIR*
> 4 Copley Place
> Boston, MA 02116
>
> (617) 532-6100 x1231
> ailov...@wayfair.com
>
>
>


RE: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Mohammed Guller
Vertica also provides a Spark connector. It was not GA the last time I looked 
at it, but available on the Vertica community site. Have you tried using the 
Vertica Spark connector instead of the JDBC driver?

Mohammed
Author: Big Data Analytics with 
Spark

From: Aaron Ilovici [mailto:ailov...@wayfair.com]
Sent: Thursday, May 26, 2016 8:08 AM
To: user@spark.apache.org; d...@spark.apache.org
Subject: JDBC Dialect for saving DataFrame into Vertica Table

I am attempting to write a DataFrame of Rows to Vertica via DataFrameWriter's 
jdbc function in the following manner:

dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties);

This works when there are no NULL values in any of the Rows in my DataFrame. 
However, when there are rows, I get the following error:

ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.sql.SQLFeatureNotSupportedException: [Vertica][JDBC](10220) Driver not 
capable.
at com.vertica.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown 
Source)
at com.vertica.jdbc.common.SPreparedStatement.setNull(Unknown Source)

This appears to be Spark's attempt to set a null value in a PreparedStatement, 
but Vertica does not understand the type upon executing the transaction. I see 
in JdbcDialects.scala that there are dialects for MySQL, Postgres, DB2, 
MsSQLServer, Derby, and Oracle.

1 - Would writing a dialect for Vertica eleviate the issue, by setting a 'NULL' 
in a type that Vertica would understand?
2 - What would be the best way to do this without a Spark patch? Scala, Java, 
make a jar and call 'JdbcDialects.registerDialect(VerticaDialect)' once created?
3 - Where would one find the proper mapping between Spark DataTypes and Vertica 
DataTypes? I don't see 'NULL' handling for any of the dialects, only the base 
case 'case _ => None' - is None mapped to the proper NULL type elsewhere?

My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7

I would be happy to create a Jira and submit a pull request with the 
VerticaDialect once I figure this out.

Thank you for any insight on this,

AARON ILOVICI
Software Engineer
Marketing Engineering

[cid:image001.png@01D1B760.973BD800]

WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com




Insert into JDBC

2016-05-26 Thread Andrés Ivaldi
Hello,
 I'realize that when dataframe executes insert it is inserting by scheme
order column instead by name, ie

dataframe.write(SaveMode).jdbc(url, table, properties)

Reading the profiler the execution is

insert into TableName values(a,b,c..)

what i need is
insert into TableNames (colA,colB,colC) values(a,b,c)

could be some configuration?

regards.

-- 
Ing. Ivaldi Andres


Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Michael Armbrust
You can also just make sure that each user is using their own directory.  A
rough example can be found in TestHive.

Note: in Spark 2.0 there should be no need to use HiveContext unless you
need to talk to a metastore.

On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh 
wrote:

> Well make sure than you set up a reasonable RDBMS as metastore. Ours is
> Oracle but you can get away with others. Check the supported list in
>
> hduser@rhes564:: :/usr/lib/hive/scripts/metastore/upgrade> ltr
> total 40
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 postgres
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mysql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 mssql
> drwxr-xr-x 2 hduser hadoop 4096 Feb 21 23:48 derby
> drwxr-xr-x 3 hduser hadoop 4096 May 20 18:44 oracle
>
> you have few good ones in the list.  In general the base tables (without
> transactional support) are around 55  (Hive 2) and don't take much space
> (depending on the volume of tables). I attached a E-R diagram.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 26 May 2016 at 19:09, Gerard Maas  wrote:
>
>> Thanks a lot for the advice!.
>>
>> I found out why the standalone hiveContext would not work:  it was trying
>> to deploy a derby db and the user had no rights to create the dir where
>> there db is stored:
>>
>> Caused by: java.sql.SQLException: Failed to create database
>> 'metastore_db', see the next exception for details.
>>
>>at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>> Source)
>>
>>at
>> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>> Source)
>>
>>... 129 more
>>
>> Caused by: java.sql.SQLException: Directory
>> /usr/share/spark-notebook/metastore_db cannot be created.
>>
>>
>> Now, the new issue is that we can't start more than 1 context at the same
>> time. I think we will need to setup a proper metastore.
>>
>>
>> -kind regards, Gerard.
>>
>>
>>
>>
>> On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> To use HiveContext witch is basically an sql api within Spark without
>>> proper hive set up does not make sense. It is a super set of Spark
>>> SQLContext
>>>
>>> In addition simple things like registerTempTable may not work.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 26 May 2016 at 13:01, Silvio Fiorito 
>>> wrote:
>>>
 Hi Gerard,



 I’ve never had an issue using the HiveContext without a hive-site.xml
 configured. However, one issue you may have is if multiple users are
 starting the HiveContext from the same path, they’ll all be trying to store
 the default Derby metastore in the same location. Also, if you want them to
 be able to persist permanent table metadata for SparkSQL then you’ll want
 to set up a true metastore.



 The other thing it could be is Hive dependency collisions from the
 classpath, but that shouldn’t be an issue since you said it’s standalone
 (not a Hadoop distro right?).



 Thanks,

 Silvio



 *From: *Gerard Maas 
 *Date: *Thursday, May 26, 2016 at 5:28 AM
 *To: *spark users 
 *Subject: *HiveContext standalone => without a Hive metastore



 Hi,



 I'm helping some folks setting up an analytics cluster with  Spark.

 They want to use the HiveContext to enable the Window functions on
 DataFrames(*) but they don't have any Hive installation, nor they need one
 at the moment (if not necessary for this feature)



 When we try to create a Hive context, we get the following error:



 > val sqlContext = new
 org.apache.spark.sql.hive.HiveContext(sparkContext)

 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)



 Is my HiveContext failing b/c it wants to connect to an unconfigured
  Hive Metastore?



 Is there  a way to instantiate a HiveContext for the sake of Window
 support without an underlying Hive deployment?



 The docs are explicit in saying that that is should be the case: [1]



Spark input size when filtering on parquet files

2016-05-26 Thread Dennis Hunziker
Hi all



I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to find
out about the improvements made in filtering/scanning parquet files when
querying for tables using SparkSQL and how these changes relate to the new
filter API introduced in Parquet 1.7.0.



After checking the usual sources, I still can’t make sense of some of the
numbers shown on the Spark UI. As an example, I’m looking at the collect
stage for a query that’s selecting a single row from a table containing 1
million numbers using a simple where clause (i.e. col1 = 50) and this
is what I see on the UI:



0 SUCCESS ... 2.4 MB (hadoop) / 0

1 SUCCESS ... 2.4 MB (hadoop) / 25

2 SUCCESS ... 2.4 MB (hadoop) / 0

3 SUCCESS ... 2.4 MB (hadoop) / 0



Based on the min/max statistics of each of the parquet parts, it makes
sense not to expect any records for 3 out of the 4, because the record I’m
looking for can only be in a single file. But why is the input size above
shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
whole stage? Isn't it just meant to read the metadata and ignore the
content of the file?



Regards,

Dennis


Re: How to set the degree of parallelism in Spark SQL?

2016-05-26 Thread Mich Talebzadeh
Also worth adding that in standalone mode there is only one executor per
spark-submit job.

In Standalone cluster mode Spark allocates resources based on cores. By
default, an application will grab all the cores in the cluster.

You only have one worker that lives within the driver JVM process that you
start when you start the application with spark-shell or spark-submit in
the host where the cluster manager is running.

The Driver node runs on the same host that the cluster manager is running.
The Driver requests the Cluster Manager for resources to run tasks.. That
worker is tasked to create the executor (in this case there is only one
executor) for the Driver. The Executor runs tasks for the Driver. Only one
executor can be allocated on each worker per application

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 May 2016 at 18:45, Ian  wrote:

> The number of executors is set when you launch the shell or an application
> with /spark-submit/. It's controlled by the /num-executors/ parameter:
>
> https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/
> .
>
> Important is also that cranking up the number may not cause your queries to
> run faster. If you set it to, let's say 200, but you only have 10 cores
> divided over 5 nodes, then you may not see a significant speed-up beyond
> 5-10 executors.
>
> You may want to check out Cloudera's tuning guide:
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996p27031.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
Sounds like you better talk to Horton Works then

On Thu, May 26, 2016 at 2:33 PM, Mail.com  wrote:
> Hi Cody,
>
> I used Horton Works jars for spark streaming that would enable get messages  
> from Kafka with kerberos.
>
> Thanks,
> Pradeep
>
>
>> On May 26, 2016, at 11:04 AM, Cody Koeninger  wrote:
>>
>> I wouldn't expect kerberos to work with anything earlier than the beta
>> consumer for kafka 0.10
>>
>>> On Wed, May 25, 2016 at 9:41 PM, Mail.com  wrote:
>>> Hi All,
>>>
>>> I am connecting Spark 1.6 streaming  to Kafka 0.8.2 with Kerberos. I ran 
>>> spark streaming in debug mode, but do not see any log saying it connected 
>>> to Kafka or  topic etc. How could I enable that.
>>>
>>> My spark streaming job runs but no messages are fetched from the RDD. 
>>> Please suggest.
>>>
>>> Thanks,
>>> Pradeep
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka connection logs in Spark

2016-05-26 Thread Mail.com
Hi Cody,

I used Horton Works jars for spark streaming that would enable get messages  
from Kafka with kerberos.

Thanks,
Pradeep


> On May 26, 2016, at 11:04 AM, Cody Koeninger  wrote:
> 
> I wouldn't expect kerberos to work with anything earlier than the beta
> consumer for kafka 0.10
> 
>> On Wed, May 25, 2016 at 9:41 PM, Mail.com  wrote:
>> Hi All,
>> 
>> I am connecting Spark 1.6 streaming  to Kafka 0.8.2 with Kerberos. I ran 
>> spark streaming in debug mode, but do not see any log saying it connected to 
>> Kafka or  topic etc. How could I enable that.
>> 
>> My spark streaming job runs but no messages are fetched from the RDD. Please 
>> suggest.
>> 
>> Thanks,
>> Pradeep
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem instantiation of HiveContext

2016-05-26 Thread Ian
The exception indicates that Spark cannot invoke the method it's trying to
call, which is probably caused by a library missing. Do you have a Hive
configuration (hive-site.xml) or similar in your $SPARK_HOME/conf folder?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-instantiation-of-HiveContext-tp26999p27035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Thanks a lot for the advice!.

I found out why the standalone hiveContext would not work:  it was trying
to deploy a derby db and the user had no rights to create the dir where
there db is stored:

Caused by: java.sql.SQLException: Failed to create database 'metastore_db',
see the next exception for details.

   at
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)

   at
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
Source)

   ... 129 more

Caused by: java.sql.SQLException: Directory
/usr/share/spark-notebook/metastore_db cannot be created.


Now, the new issue is that we can't start more than 1 context at the same
time. I think we will need to setup a proper metastore.


-kind regards, Gerard.




On Thu, May 26, 2016 at 3:06 PM, Mich Talebzadeh 
wrote:

> To use HiveContext witch is basically an sql api within Spark without
> proper hive set up does not make sense. It is a super set of Spark
> SQLContext
>
> In addition simple things like registerTempTable may not work.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 26 May 2016 at 13:01, Silvio Fiorito 
> wrote:
>
>> Hi Gerard,
>>
>>
>>
>> I’ve never had an issue using the HiveContext without a hive-site.xml
>> configured. However, one issue you may have is if multiple users are
>> starting the HiveContext from the same path, they’ll all be trying to store
>> the default Derby metastore in the same location. Also, if you want them to
>> be able to persist permanent table metadata for SparkSQL then you’ll want
>> to set up a true metastore.
>>
>>
>>
>> The other thing it could be is Hive dependency collisions from the
>> classpath, but that shouldn’t be an issue since you said it’s standalone
>> (not a Hadoop distro right?).
>>
>>
>>
>> Thanks,
>>
>> Silvio
>>
>>
>>
>> *From: *Gerard Maas 
>> *Date: *Thursday, May 26, 2016 at 5:28 AM
>> *To: *spark users 
>> *Subject: *HiveContext standalone => without a Hive metastore
>>
>>
>>
>> Hi,
>>
>>
>>
>> I'm helping some folks setting up an analytics cluster with  Spark.
>>
>> They want to use the HiveContext to enable the Window functions on
>> DataFrames(*) but they don't have any Hive installation, nor they need one
>> at the moment (if not necessary for this feature)
>>
>>
>>
>> When we try to create a Hive context, we get the following error:
>>
>>
>>
>> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
>>
>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>
>>at
>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>>
>>
>>
>> Is my HiveContext failing b/c it wants to connect to an unconfigured
>>  Hive Metastore?
>>
>>
>>
>> Is there  a way to instantiate a HiveContext for the sake of Window
>> support without an underlying Hive deployment?
>>
>>
>>
>> The docs are explicit in saying that that is should be the case: [1]
>>
>>
>>
>> "To use a HiveContext, you do not need to have an existing Hive setup,
>> and all of the data sources available to aSQLContext are still
>> available. HiveContext is only packaged separately to avoid including
>> all of Hive’s dependencies in the default Spark build."
>>
>>
>>
>> So what is the right way to address this issue? How to instantiate a
>> HiveContext with spark running on a HDFS cluster without Hive deployed?
>>
>>
>>
>>
>>
>> Thanks a lot!
>>
>>
>>
>> -Gerard.
>>
>>
>>
>> (*) The need for a HiveContext to use Window functions is pretty obscure.
>> The only documentation of this seems to be a runtime exception: 
>> "org.apache.spark.sql.AnalysisException:
>> Could not resolve window function 'max'. Note that, using window functions
>> currently requires a HiveContext;"
>>
>>
>>
>> [1]
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
>>
>
>


Re: Stackoverflowerror in scala.collection

2016-05-26 Thread Jeff Jones
I’ve seen this when I specified “too many” where clauses in the SQL query. I 
was able to adjust my query to use a single ‘in’ clause rather than many ‘=’ 
clauses but I realize that may not be an option in all cases.

Jeff

On 5/4/16, 2:04 PM, "BenD"  wrote:

>I am getting a java.lang.StackOverflowError somewhere in my program. I am not
>able to pinpoint which part causes it because the stack trace seems to be
>incomplete (see end of message). The error doesn't happen all the time, and
>I think it is based on the number of files that I load. I am running on AWS
>EMR with Spark 1.6.0 with an m1.xlarge as driver and 3 r8.xlarge (244GB ram
>+ 32 cores each) and an r2.xlarge (61GB ram + 8 cores) as executor machines
>with the following configuration:
>
>spark.driver.cores2
>spark.yarn.executor.memoryOverhead5000
>spark.dynamicAllocation.enabledtrue
>spark.executor.cores2
>spark.driver.memory14g
>spark.executor.memory12g
>
>While I can't post the full code or data, I will give a basic outline. I am
>loading many json files from S3 into a JavaRDD which is then mapped
>to a JavaPairRDD where the Long is the timestamp of the file.
>I then union the RDDs into a single RDD which is then turned into a
>DataFrame. After I have this dataframe, I run an SQL query on it and then
>dump the result to S3.
>
>A cut down version of the code would look similar to this:
>
>List> linesList = validFiles.map(x -> {
> try {
>   Long date = dateMapper.call(x);
>   return context.textFile(x.asPath()).mapToPair(y -> new
>Tuple2<>(date, y));
> } catch (Exception e) {
>  throw new RuntimeException(e);
> }
>}).collect(Collectors.toList());
>
>JavaPairRDD unionedRDD = linesList.get(0);
>if (linesList.size() > 1) {
>unionedRDD = context.union(unionedRDD , linesList.subList(1,
>linesList.size()));
>}
>
>HiveContext sqlContext = new HiveContext(context);
>DataFrame table = sqlContext.read().json(unionedRDD.values());
>table.registerTempTable("table");
>sqlContext.cacheTable("table");
>dumpToS3(sqlContext.sql("query"));
>
>
>This runs fine some times, but other times I get the
>java.lang.StackOverflowError. I know the error happens on a run where 7800
>files are loaded. Based on the error message mentioning mapped values, I
>assume the problem occurs in the mapToPair function, but I don't know why it
>happens. Does anyone have some insight into this problem?
>
>This is the whole print out of the error as seen in the container log:
>java.lang.StackOverflowError
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at
>scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>at

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Ian
Have you tried saveAsNewAPIHadoopFile?

See:
http://stackoverflow.com/questions/29238126/how-to-save-a-spark-rdd-to-an-avro-file



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-RDD-of-Avro-GenericRecord-as-parquet-throws-UnsupportedOperationException-tp27025p27034.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: List of questios about spark

2016-05-26 Thread Ian
I'll attempt to answer a few of your questions:

There are no limitations with regard to the number of dimension or lookup
tables for Spark. As long as you have disk space, you should have no
problem. Obviously, if you do joins among dozens or hundreds of tables it
may take a while since it's unlikely that you can cache all of the tables.
You may be able to cache (temporary) lookup tables, which means that joins
to the fact table(s) would be a lot faster.

This also means that for Spark there is no additional direct cost. You may
need more hardware because of the storage requirements and perhaps also more
RAM to be able to handle more cached tables and concurrency. With Spark you
can at least choose to persist tables in memory and spill to disk only when
necessary. MapReduce is 100% disk-based, for instance.

Windowing functions are supported either by HiveQL (i.e. via SQLContext.sql
or HiveContext.sql - in Spark 2.0 these will have the same entry point) or
via API functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

In the API you'll also find other functions you're looking for. Moreover,
you can also check out the documentation for Hive because that's also
available. For instance, in 1.6 there is no equivalent to Hive's LATERAL
VIEW OUTER, but since in the sql() method you have access to that, it's not
a limitation. There just is no native method in the API.

Technically there are no limitations on joins, although for bigger tables
they will take longer. Caching really helps you out here. Nested queries are
no problem, you can always use SQLContext or HiveContext.sql, which gives
you a normal SQL interface.

Spark has APIs for Scala, Java, Python and R.

By the way, I assume you mean 'a billion rows'. Most of your questions are
answered on the official Spark pages, so please have a look there too.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/List-of-questios-about-spark-tp27027p27033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to set the degree of parallelism in Spark SQL?

2016-05-26 Thread Ian
The number of executors is set when you launch the shell or an application
with /spark-submit/. It's controlled by the /num-executors/ parameter:
https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/.

Important is also that cranking up the number may not cause your queries to
run faster. If you set it to, let's say 200, but you only have 10 cores
divided over 5 nodes, then you may not see a significant speed-up beyond
5-10 executors.

You may want to check out Cloudera's tuning guide:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996p27031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Subtract two DataFrames is not working

2016-05-26 Thread Ted Yu
Can you be a bit more specific about how they didn't work ?

BTW 1.4.1 seems to be an old release.

Please try 1.6.1 if possible.

Cheers

On Thu, May 26, 2016 at 9:44 AM, Gurusamy Thirupathy 
wrote:

> I have to subtract two dataframes, I tried with except method but it's not
> working. I tried with drop also. I am using spark 1.4.1 version. And Scala
> 2.10.
> Can you please help?
>
>
> Thanks,
> Guru
>


Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
that is nice and compact, but it does not add the columns to an existing
dataframe

On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> How about this?
> --
> val func = udf((i: Int) => Tuple2(i, i))
> val df = Seq((1, 0), (2, 5)).toDF("a", "b")
> df.select(func($"a").as("r")).select($"r._1", $"r._2")
>
> // maropu
>
>
> On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers  wrote:
>
>> hello all,
>>
>> i have a single udf that creates 2 outputs (so a tuple 2). i would like
>> to add these 2 columns to my dataframe.
>>
>> my current solution is along these lines:
>> df
>>   .withColumn("_temp_", udf(inputColumns))
>>   .withColumn("x", col("_temp_)("_1"))
>>   .withColumn("y", col("_temp_")("_2"))
>>   .drop("_temp_")
>>
>> this works, but its not pretty with the temporary field stuff.
>>
>> i also tried this:
>> val tmp = udf(inputColumns)
>> df
>>   .withColumn("x", tmp("_1"))
>>   .withColumn("y", tmp("_2"))
>>
>> this also works, but unfortunately the udf is evaluated twice
>>
>> is there a better way to do this?
>>
>> thanks! koert
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Subtract two DataFrames is not working

2016-05-26 Thread Gurusamy Thirupathy
I have to subtract two dataframes, I tried with except method but it's not
working. I tried with drop also. I am using spark 1.4.1 version. And Scala
2.10.
Can you please help?


Thanks,
Guru


Distributed matrices with column counts represented by Int (rather than Long)

2016-05-26 Thread Phillip Henry
Hi,

I notice that some DistributedMatrix represent the number of columns with
an Int rather than a Long (RowMatrix etc). This limits the number of
columns to about 2 billion.

We're approaching that limit. What do people recommend we do to mitigate
the problem? Are there plans to use a larger data type as the trait
suggests it should be?

Regards,

Phillip


Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtWmyYB5fweR=Re+Best+way+to+store+Avro+Objects+as+Parquet+using+SPARK

On Thu, May 26, 2016 at 6:55 AM, Govindasamy, Nagarajan <
ngovindas...@turbine.com> wrote:

> Hi,
>
> I am trying to save RDD of Avro GenericRecord as parquet. I am using Spark
> 1.6.1.
>
>
> DStreamOfAvroGenericRecord.foreachRDD(rdd => 
> rdd.toDF().write.parquet("s3://bucket/data.parquet"))
>
> Getting the following exception. Is there a way to save Avro GenericRecord
> as Parquet or ORC file?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *java.lang.UnsupportedOperationException: Schema for type
> org.apache.avro.generic.GenericRecord is not supported  at
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:690)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:689)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:689)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:642)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
> at
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
> at
> org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)*
>
> Thanks,
>
> Raj
>


Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Thank you Cody, i will try to follow your advice.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-05-26 17:00 GMT+02:00 Cody Koeninger :

> Honestly given this thread, and the stack overflow thread, I'd say you
> need to back up, start very simply, and learn spark.  If for some reason
> the official docs aren't doing it for you, learning spark from oreilly is a
> good book.
>
> Given your specific question, why not just
>
> messages.foreachRDD { rdd =>
> rdd.foreachPartition { iterator =>
>   someWorkOnAnIterator(iterator)
>
>
> All of the other extraneous stuff you're doing doesn't make any sense to
> me.
>
>
>
> On Thu, May 26, 2016 at 2:48 AM, Alonso Isidoro Roman 
> wrote:
>
>> Hi Matthias and Cody,
>>
>> You can see in the code that StreamingContext.start() is called after the
>> messages.foreachRDD output action. Another problem @Cody is how can i avoid
>> the inner .foreachRDD(_.foreachPartition(it =>
>> recommender.predictWithALS(it.toSeq))) in order to invoke asynchronously
>> recommender.predictWithALS which runs a machine learning ALS implementation
>> with a message from the kafka topic?.
>>
>> In the actual code i am not using for now any code to save data within
>> the mongo instance, for now, it is more important to be focus in how to
>> receive the message from the kafka topic and feeding asynchronously the ALS
>> implementation. Probably the Recommender object will need the code for
>>  interact with the mongo instance.
>>
>> The idea of the process is to receive data from the kafka topic,
>> calculate its recommendations based on the incoming message and save the
>> results within a mongo instance. Is it possible?  Am i missing something
>> important?
>>
>> def main(args: Array[String]) {
>> // Process program arguments and set properties
>>
>> if (args.length < 2) {
>>   System.err.println("Usage: " + this.getClass.getSimpleName + "
>>  ")
>>   System.exit(1)
>> }
>>
>> val Array(brokers, topics) = args
>>
>> println("Initializing Streaming Spark Context and kafka connector...")
>> // Create context with 2 second batch interval
>> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>>.setMaster("local[4]")
>>
>> .set("spark.driver.allowMultipleContexts", "true")
>>
>> val sc = new SparkContext(sparkConf)
>>
>> sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>> //this checkpointdir should be in a conf file, for now it is
>> hardcoded!
>> val streamingCheckpointDir =
>> "/Users/aironman/my-recommendation-spark-engine/checkpoint"
>> ssc.checkpoint(streamingCheckpointDir)
>>
>> // Create direct kafka stream with brokers and topics
>> val topicsSet = topics.split(",").toSet
>> val kafkaParams = Map[String, String]("metadata.broker.list" ->
>> brokers)
>> val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
>> println("Initialized Streaming Spark Context and kafka connector...")
>>
>> //create recomendation module
>> println("Creating rating recommender module...")
>> val ratingFile= "ratings.csv"
>> val recommender = new Recommender(sc,ratingFile)
>> println("Initialized rating recommender module...")
>>
>> //i have to convert messages which is a InputDStream into a
>> Seq[AmazonRating]...
>> try{
>> messages.foreachRDD( rdd =>{
>>   val count = rdd.count()
>>   if (count > 0){
>> //someMessages should be AmazonRating...
>> val someMessages = rdd.take(count.toInt)
>> println("<-->")
>> println("someMessages is " + someMessages)
>> someMessages.foreach(println)
>> println("<-->")
>> println("<---POSSIBLE SOLUTION--->")
>>
>> messages
>> .map { case (_, jsonRating) =>
>>   val jsValue = Json.parse(jsonRating)
>>   AmazonRating.amazonRatingFormat.reads(jsValue) match {
>> case JsSuccess(rating, _) => rating
>> case JsError(_) => AmazonRating.empty
>>   }
>>  }
>> .filter(_ != AmazonRating.empty)
>> //probably is not a good idea to do this...
>> .foreachRDD(_.foreachPartition(it =>
>> recommender.predictWithALS(it.toSeq)))
>>
>> println("<---POSSIBLE SOLUTION--->")
>>
>>   }
>>   }
>> )
>> }catch{
>>   case e: IllegalArgumentException => {println("illegal arg.
>> exception")};
>>   case e: IllegalStateException=> {println("illegal state
>> exception")};
>>   case e: ClassCastException   => {println("ClassCastException")};
>>   case e: Exception=> {println(" Generic 

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Cody Koeninger
Honestly given this thread, and the stack overflow thread, I'd say you need
to back up, start very simply, and learn spark.  If for some reason the
official docs aren't doing it for you, learning spark from oreilly is a
good book.

Given your specific question, why not just

messages.foreachRDD { rdd =>
rdd.foreachPartition { iterator =>
  someWorkOnAnIterator(iterator)


All of the other extraneous stuff you're doing doesn't make any sense to
me.



On Thu, May 26, 2016 at 2:48 AM, Alonso Isidoro Roman 
wrote:

> Hi Matthias and Cody,
>
> You can see in the code that StreamingContext.start() is called after the
> messages.foreachRDD output action. Another problem @Cody is how can i avoid
> the inner .foreachRDD(_.foreachPartition(it =>
> recommender.predictWithALS(it.toSeq))) in order to invoke asynchronously
> recommender.predictWithALS which runs a machine learning ALS implementation
> with a message from the kafka topic?.
>
> In the actual code i am not using for now any code to save data within the
> mongo instance, for now, it is more important to be focus in how to receive
> the message from the kafka topic and feeding asynchronously the ALS
> implementation. Probably the Recommender object will need the code for
>  interact with the mongo instance.
>
> The idea of the process is to receive data from the kafka topic, calculate
> its recommendations based on the incoming message and save the results
> within a mongo instance. Is it possible?  Am i missing something important?
>
> def main(args: Array[String]) {
> // Process program arguments and set properties
>
> if (args.length < 2) {
>   System.err.println("Usage: " + this.getClass.getSimpleName + "
>  ")
>   System.exit(1)
> }
>
> val Array(brokers, topics) = args
>
> println("Initializing Streaming Spark Context and kafka connector...")
> // Create context with 2 second batch interval
> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>.setMaster("local[4]")
>
> .set("spark.driver.allowMultipleContexts", "true")
>
> val sc = new SparkContext(sparkConf)
>
> sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> //this checkpointdir should be in a conf file, for now it is hardcoded!
> val streamingCheckpointDir =
> "/Users/aironman/my-recommendation-spark-engine/checkpoint"
> ssc.checkpoint(streamingCheckpointDir)
>
> // Create direct kafka stream with brokers and topics
> val topicsSet = topics.split(",").toSet
> val kafkaParams = Map[String, String]("metadata.broker.list" ->
> brokers)
> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
> println("Initialized Streaming Spark Context and kafka connector...")
>
> //create recomendation module
> println("Creating rating recommender module...")
> val ratingFile= "ratings.csv"
> val recommender = new Recommender(sc,ratingFile)
> println("Initialized rating recommender module...")
>
> //i have to convert messages which is a InputDStream into a
> Seq[AmazonRating]...
> try{
> messages.foreachRDD( rdd =>{
>   val count = rdd.count()
>   if (count > 0){
> //someMessages should be AmazonRating...
> val someMessages = rdd.take(count.toInt)
> println("<-->")
> println("someMessages is " + someMessages)
> someMessages.foreach(println)
> println("<-->")
> println("<---POSSIBLE SOLUTION--->")
>
> messages
> .map { case (_, jsonRating) =>
>   val jsValue = Json.parse(jsonRating)
>   AmazonRating.amazonRatingFormat.reads(jsValue) match {
> case JsSuccess(rating, _) => rating
> case JsError(_) => AmazonRating.empty
>   }
>  }
> .filter(_ != AmazonRating.empty)
> //probably is not a good idea to do this...
> .foreachRDD(_.foreachPartition(it =>
> recommender.predictWithALS(it.toSeq)))
>
> println("<---POSSIBLE SOLUTION--->")
>
>   }
>   }
> )
> }catch{
>   case e: IllegalArgumentException => {println("illegal arg.
> exception")};
>   case e: IllegalStateException=> {println("illegal state
> exception")};
>   case e: ClassCastException   => {println("ClassCastException")};
>   case e: Exception=> {println(" Generic Exception")};
> }finally{
>
>   println("Finished taking data from kafka topic...")
> }
>
> //println("jsonParsed is " + jsonParsed)
> //The idea is to save results from Recommender.predict within mongodb,
> so i will have to deal with this issue
> //after resolving the issue of
> .foreachRDD(_.foreachPartition(recommender.predict(_.toSeq)))
>
> *ssc.start()*
> ssc.awaitTermination()
>
> 

JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Aaron Ilovici
I am attempting to write a DataFrame of Rows to Vertica via DataFrameWriter's 
jdbc function in the following manner:

dataframe.write().mode(SaveMode.Append).jdbc(url, table, properties);

This works when there are no NULL values in any of the Rows in my DataFrame. 
However, when there are rows, I get the following error:

ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.sql.SQLFeatureNotSupportedException: [Vertica][JDBC](10220) Driver not 
capable.
at com.vertica.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown 
Source)
at com.vertica.jdbc.common.SPreparedStatement.setNull(Unknown Source)

This appears to be Spark's attempt to set a null value in a PreparedStatement, 
but Vertica does not understand the type upon executing the transaction. I see 
in JdbcDialects.scala that there are dialects for MySQL, Postgres, DB2, 
MsSQLServer, Derby, and Oracle.

1 - Would writing a dialect for Vertica eleviate the issue, by setting a 'NULL' 
in a type that Vertica would understand?
2 - What would be the best way to do this without a Spark patch? Scala, Java, 
make a jar and call 'JdbcDialects.registerDialect(VerticaDialect)' once created?
3 - Where would one find the proper mapping between Spark DataTypes and Vertica 
DataTypes? I don't see 'NULL' handling for any of the dialects, only the base 
case 'case _ => None' - is None mapped to the proper NULL type elsewhere?

My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7
I would be happy to create a Jira and submit a pull request with the 
VerticaDialect once I figure this out.

Thank you for any insight on this,

AARON ILOVICI
Software Engineer
Marketing Engineering

[cid:image001.png@01D1B73E.F0A44D90]

WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com




Re: Kafka connection logs in Spark

2016-05-26 Thread Cody Koeninger
I wouldn't expect kerberos to work with anything earlier than the beta
consumer for kafka 0.10

On Wed, May 25, 2016 at 9:41 PM, Mail.com  wrote:
> Hi All,
>
> I am connecting Spark 1.6 streaming  to Kafka 0.8.2 with Kerberos. I ran 
> spark streaming in debug mode, but do not see any log saying it connected to 
> Kafka or  topic etc. How could I enable that.
>
> My spark streaming job runs but no messages are fetched from the RDD. Please 
> suggest.
>
> Thanks,
> Pradeep
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Job Execution halts during shuffle...

2016-05-26 Thread Priya Ch
Hello Team,


 I am trying to perform join 2 rdds where one is of size 800 MB and the
other is 190 MB. During the join step, my job halts and I don't see
progress in the execution.

This is the message I see on console -

INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
locations for shuffle 0 to :4
INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
locations for shuffle 1 to :4

After these messages, I dont see any progress. I am using Spark 1.6.0
version and yarn scheduler (running in YARN client mode). My cluster
configurations is - 3 node cluster (1 master and 2 slaves). Each slave has
1 TB hard disk space, 300GB memory and 32 cores.

HDFS block size is 128 MB.

Thanks,
Padma Ch


save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Govindasamy, Nagarajan
Hi,

I am trying to save RDD of Avro GenericRecord as parquet. I am using Spark 
1.6.1.

DStreamOfAvroGenericRecord.foreachRDD(rdd => 
rdd.toDF().write.parquet("s3://bucket/data.parquet"))

Getting the following exception. Is there a way to save Avro GenericRecord as 
Parquet or ORC file?

java.lang.UnsupportedOperationException: Schema for type 
org.apache.avro.generic.GenericRecord is not supported
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:690)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:689)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:689)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:642)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at 
org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)

Thanks,

Raj


Apache Spark Video Processing from NFS Shared storage: Advise needed

2016-05-26 Thread mobcdi
Hi all,

Is it advisable to use nfs as shared storage for a small Spark cluster to
process video and images? I have a total of 20 vms (2vCPU, 6GB Ram, 20GB
Local Disk) connected to 500GB nfs shared storage (mounted the same in each
of the vms) at my disposal and I'm wondering if I can avoid the need for
hdfs and instead use the larger capacity nfs to work with my videos and
images in Spark?

I have spun up a master node (using maven not sbt) and connected 1 slave to
it but I haven't made any configuration changes to Spark. On
masternode:4040/ I don't see anything under storage. Is that to be expected
and if I do need to spin up hadoop can I double job the 20vms by running
both hadoop and spark on all 20 machines or would the recommendation be I
split them into separate hadoop and spark clusters 

Michael



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Video-Processing-from-NFS-Shared-storage-Advise-needed-tp27030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Mich Talebzadeh
To use HiveContext witch is basically an sql api within Spark without
proper hive set up does not make sense. It is a super set of Spark
SQLContext

In addition simple things like registerTempTable may not work.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 May 2016 at 13:01, Silvio Fiorito 
wrote:

> Hi Gerard,
>
>
>
> I’ve never had an issue using the HiveContext without a hive-site.xml
> configured. However, one issue you may have is if multiple users are
> starting the HiveContext from the same path, they’ll all be trying to store
> the default Derby metastore in the same location. Also, if you want them to
> be able to persist permanent table metadata for SparkSQL then you’ll want
> to set up a true metastore.
>
>
>
> The other thing it could be is Hive dependency collisions from the
> classpath, but that shouldn’t be an issue since you said it’s standalone
> (not a Hadoop distro right?).
>
>
>
> Thanks,
>
> Silvio
>
>
>
> *From: *Gerard Maas 
> *Date: *Thursday, May 26, 2016 at 5:28 AM
> *To: *spark users 
> *Subject: *HiveContext standalone => without a Hive metastore
>
>
>
> Hi,
>
>
>
> I'm helping some folks setting up an analytics cluster with  Spark.
>
> They want to use the HiveContext to enable the Window functions on
> DataFrames(*) but they don't have any Hive installation, nor they need one
> at the moment (if not necessary for this feature)
>
>
>
> When we try to create a Hive context, we get the following error:
>
>
>
> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
>
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>
>at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>
>
>
> Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive
> Metastore?
>
>
>
> Is there  a way to instantiate a HiveContext for the sake of Window
> support without an underlying Hive deployment?
>
>
>
> The docs are explicit in saying that that is should be the case: [1]
>
>
>
> "To use a HiveContext, you do not need to have an existing Hive setup,
> and all of the data sources available to aSQLContext are still available.
> HiveContext is only packaged separately to avoid including all of Hive’s
> dependencies in the default Spark build."
>
>
>
> So what is the right way to address this issue? How to instantiate a
> HiveContext with spark running on a HDFS cluster without Hive deployed?
>
>
>
>
>
> Thanks a lot!
>
>
>
> -Gerard.
>
>
>
> (*) The need for a HiveContext to use Window functions is pretty obscure.
> The only documentation of this seems to be a runtime exception: 
> "org.apache.spark.sql.AnalysisException:
> Could not resolve window function 'max'. Note that, using window functions
> currently requires a HiveContext;"
>
>
>
> [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
>


System.exit in local mode ?

2016-05-26 Thread yael aharon
Hello,
I have noticed that in
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala
spark would call System.exit if an uncaught exception was encountered.
I have an application that is running spark in local mode, and would like
to avoid exiting the application if that happens.
Will spark exit my application in local mode too, or is that the behavior
only in cluster mode? Is there a setting to override this behavior?
thanks, Yael


Re: Error while saving plots

2016-05-26 Thread Sonal Goyal
Does the path /home/njoshi/dev/outputs/test_/plots/  exist on the driver ?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Wed, May 25, 2016 at 2:07 AM, njoshi  wrote:

> For an analysis app, I have to make ROC curves on the fly and save to
> disc. I
> am using scala-chart for this purpose and doing the following in my Spark
> app:
>
>
> val rocs = performances.map{case (id, (auRoc, roc)) => (id,
> roc.collect().toList)}
> XYLineChart(rocs.toSeq, title = "Pooled Data Performance:
> AuROC").saveAsPNG(outputpath + "/plots/global.png")
>
>
> However, I am getting the following exception. Does anyone have idea of the
> cause?
>
>
> Exception in thread "main" java.io.FileNotFoundException:
> file:/home/njoshi/dev/outputs/test_/plots/global.png (No such file or
> directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at java.io.FileOutputStream.(FileOutputStream.java:101)
> at
>
> scalax.chart.exporting.PNGExporter$.saveAsPNG$extension(exporting.scala:138)
> at
>
> com.aol.advertising.ml.globaldata.EvaluatorDriver$.main(EvaluatorDriver.scala:313)
> at
>
> com.aol.advertising.ml.globaldata.EvaluatorDriver.main(EvaluatorDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> Thanks in advance,
>
> Nikhil
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-saving-plots-tp27016.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Silvio Fiorito
Hi Gerard,

I’ve never had an issue using the HiveContext without a hive-site.xml 
configured. However, one issue you may have is if multiple users are starting 
the HiveContext from the same path, they’ll all be trying to store the default 
Derby metastore in the same location. Also, if you want them to be able to 
persist permanent table metadata for SparkSQL then you’ll want to set up a true 
metastore.

The other thing it could be is Hive dependency collisions from the classpath, 
but that shouldn’t be an issue since you said it’s standalone (not a Hadoop 
distro right?).

Thanks,
Silvio

From: Gerard Maas 
Date: Thursday, May 26, 2016 at 5:28 AM
To: spark users 
Subject: HiveContext standalone => without a Hive metastore

Hi,

I'm helping some folks setting up an analytics cluster with  Spark.
They want to use the HiveContext to enable the Window functions on 
DataFrames(*) but they don't have any Hive installation, nor they need one at 
the moment (if not necessary for this feature)

When we try to create a Hive context, we get the following error:

> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
   at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive 
Metastore?

Is there  a way to instantiate a HiveContext for the sake of Window support 
without an underlying Hive deployment?

The docs are explicit in saying that that is should be the case: [1]

"To use a HiveContext, you do not need to have an existing Hive setup, and all 
of the data sources available to aSQLContext are still available. HiveContext 
is only packaged separately to avoid including all of Hive’s dependencies in 
the default Spark build."

So what is the right way to address this issue? How to instantiate a 
HiveContext with spark running on a HDFS cluster without Hive deployed?


Thanks a lot!

-Gerard.

(*) The need for a HiveContext to use Window functions is pretty obscure. The 
only documentation of this seems to be a runtime exception: 
"org.apache.spark.sql.AnalysisException: Could not resolve window function 
'max'. Note that, using window functions currently requires a HiveContext;"

[1] 
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started


Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Mich Talebzadeh
Hi Gerald,

I am not sure the so called independence is will. I gather you want to use
HiveContext for your SQL queries and sqlContext only provides a subset of
HiveContext.

try this

  val sc = new SparkContext(conf)
 // Create sqlContext based on HiveContext
 val sqlContext = new HiveContext(sc)


However, ii will take 3 minutes to set up hive and all you need to add a
softlink from $SPARK_HOME/conf to hive-site.xml

hive-site.xml -> /usr/lib/hive/conf/hive-site.xml

The fact that it is not working shows that the statement in doc may not be
valid.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 May 2016 at 10:28, Gerard Maas  wrote:

> Hi,
>
> I'm helping some folks setting up an analytics cluster with  Spark.
> They want to use the HiveContext to enable the Window functions on
> DataFrames(*) but they don't have any Hive installation, nor they need one
> at the moment (if not necessary for this feature)
>
> When we try to create a Hive context, we get the following error:
>
> > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
>
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>
>at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>
> Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive
> Metastore?
>
> Is there  a way to instantiate a HiveContext for the sake of Window
> support without an underlying Hive deployment?
>
> The docs are explicit in saying that that is should be the case: [1]
>
> "To use a HiveContext, you do not need to have an existing Hive setup,
> and all of the data sources available to aSQLContext are still available.
> HiveContext is only packaged separately to avoid including all of Hive’s
> dependencies in the default Spark build."
>
> So what is the right way to address this issue? How to instantiate a
> HiveContext with spark running on a HDFS cluster without Hive deployed?
>
>
> Thanks a lot!
>
> -Gerard.
>
> (*) The need for a HiveContext to use Window functions is pretty obscure.
> The only documentation of this seems to be a runtime exception: "
> org.apache.spark.sql.AnalysisException: Could not resolve window function
> 'max'. Note that, using window functions currently requires a HiveContext;"
>
>
> [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
>


Re: How spark depends on Guava

2016-05-26 Thread Steve Loughran

On 23 May 2016, at 06:32, Todd > wrote:


Can someone please take alook at my question?I am spark-shell local mode and 
yarn-client mode.Spark code uses guava library,spark should have guava in place 
during run time.

Thanks.



At 2016-05-23 11:48:58, "Todd" > wrote:
Hi,
In the spark code, guava maven dependency scope is provided, my question is, 
how spark depends on guava during runtime? I looked into the 
spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like 
com.google.common.base.Preconditions etc...

Spark "shades" guava on import into the assembly; the libaries are moved to a 
new package and all internal references with it.

This is because guava is a nightmare of backwards compatibility.

if you want guava, decide which version you want and ask for it explicitly


Re: spark on yarn

2016-05-26 Thread Steve Loughran

> On 21 May 2016, at 15:14, Shushant Arora  wrote:
> 
> And will it allocate rest executors when other containers get freed which 
> were occupied by other hadoop jobs/spark applications?
> 

requests will go into the queue(s), they'll stay outstanding until things free 
up *or more machines join the cluster*. Whoever is in the higher priority queue 
gets that free capacity

you can also play with pre-emption, in which low priority work can get killed 
without warking

> And is there any minimum (% of executors demanded vs available) executors it 
> wait for to be freed or just start with even 1 .
> 


that's called "gang scheduling", and no, it's not in YARN. Tricky one as it can 
complicate allocation and can result in either things never getting scheduled 
or >1 app having incompletely allocated containers and, while the capacity is 
enough for one app, if the resources are assigned over both, neither can start.

look at YARN-896 to see the big todo list for services

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Nikolay Voronchikhin
Hi Jörn,

We will be upgrading to MapR 5.1, Hive 1.2, and Spark 1.6.1 at the end of
June.

In the meantime, still can this be done with these versions?
There is not a firewall issue since we have edge nodes and cluster nodes
hosted in the same location with the same NFS mount.



On Thu, May 26, 2016 at 1:34 AM, Jörn Franke  wrote:

> Both have outdated versions, usually one can support you better if you
> upgrade to the newest.
> Firewall could be an issue here.
>
>
> On 26 May 2016, at 10:11, Nikolay Voronchikhin 
> wrote:
>
> Hi PySpark users,
>
> We need to be able to run large Hive queries in PySpark 1.2.1. Users are
> running PySpark on an Edge Node, and submit jobs to a Cluster that
> allocates YARN resources to the clients.
> We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark
> 1.2.1.
>
>
> Currently, our process for writing queries works only for small result
> sets, for example:
> *from pyspark.sql import HiveContext*
> *sqlContext = HiveContext(sc)*
> *results = sqlContext.sql("select column from database.table limit
> 10").collect()*
> *results*
> 
>
>
> How do I save the HiveQL query to RDD first, then output the results?
>
> This is the error I get when running a query that requires output of
> 400,000 rows:
> *from pyspark.sql import HiveContext*
> *sqlContext = HiveContext(sc)*
> *results = sqlContext.sql("select column from database.table").collect()*
> *results*
> ...
>
> /path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self)   1976 
> """   1977 with SCCallSiteSync(self.context) as css:-> 1978   
>   bytesInJava = 
> self._jschema_rdd.baseSchemaRDD().collectToPython().iterator()   1979 
> cls = _create_cls(self.schema())   1980 return map(cls, 
> self._collect_iterator_through_file(bytesInJava))
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)536 answer = 
> self.gateway_client.send_command(command)537 return_value = 
> get_return_value(answer, self.gateway_client,--> 538 
> self.target_id, self.name)539 540 for temp_arg in temp_args:
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)298  
>raise Py4JJavaError(299 'An error occurred 
> while calling {0}{1}{2}.\n'.--> 300 format(target_id, 
> '.', name), value)301 else:302 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o76.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Failed to connect 
> to cluster_node/IP_address:port
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> For this example, ideally, this query should output the 400,000 row

HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Hi,

I'm helping some folks setting up an analytics cluster with  Spark.
They want to use the HiveContext to enable the Window functions on
DataFrames(*) but they don't have any Hive installation, nor they need one
at the moment (if not necessary for this feature)

When we try to create a Hive context, we get the following error:

> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)

java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

   at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

Is my HiveContext failing b/c it wants to connect to an unconfigured  Hive
Metastore?

Is there  a way to instantiate a HiveContext for the sake of Window support
without an underlying Hive deployment?

The docs are explicit in saying that that is should be the case: [1]

"To use a HiveContext, you do not need to have an existing Hive setup, and
all of the data sources available to aSQLContext are still available.
HiveContext is only packaged separately to avoid including all of Hive’s
dependencies in the default Spark build."

So what is the right way to address this issue? How to instantiate a
HiveContext with spark running on a HDFS cluster without Hive deployed?


Thanks a lot!

-Gerard.

(*) The need for a HiveContext to use Window functions is pretty obscure.
The only documentation of this seems to be a runtime exception: "
org.apache.spark.sql.AnalysisException: Could not resolve window function
'max'. Note that, using window functions currently requires a HiveContext;"


[1]
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started


Does decimal(6,-2) exists on purpose?

2016-05-26 Thread Ofir Manor
Hi,
was surprised to notice a negative scale on decimal (Spark 1.6.1). To
reproduce:

scala> z.printSchema
root
 |-- price: decimal(6,2) (nullable = true)

scala> val a = z.selectExpr("round(price,-2)")
a: org.apache.spark.sql.DataFrame = [round(price,-2): decimal(6,-2)]


I expected the function to return decimal(6,0)
It doesn't immediately break anything for me, but I'm not performing
additional numeric manipulation on the results.

BTW - thinking about it, both round(price) and round(price,-2) might better
return decimal(4,0), not (6,0).
The input decimal was .nn and will become just .

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Nikolay Voronchikhin
Hi PySpark users,

We need to be able to run large Hive queries in PySpark 1.2.1. Users are
running PySpark on an Edge Node, and submit jobs to a Cluster that
allocates YARN resources to the clients.
We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark
1.2.1.


Currently, our process for writing queries works only for small result
sets, for example:
*from pyspark.sql import HiveContext*
*sqlContext = HiveContext(sc)*
*results = sqlContext.sql("select column from database.table limit
10").collect()*
*results*



How do I save the HiveQL query to RDD first, then output the results?

This is the error I get when running a query that requires output of
400,000 rows:
*from pyspark.sql import HiveContext*
*sqlContext = HiveContext(sc)*
*results = sqlContext.sql("select column from database.table").collect()*
*results*
...

/path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self)
  1976 """   1977 with SCCallSiteSync(self.context) as
css:-> 1978 bytesInJava =
self._jschema_rdd.baseSchemaRDD().collectToPython().iterator()   1979
   cls = _create_cls(self.schema())   1980 return map(cls,
self._collect_iterator_through_file(bytesInJava))
/path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)536 answer =
self.gateway_client.send_command(command)537 return_value
= get_return_value(answer, self.gateway_client,--> 538
self.target_id, self.name)539 540 for temp_arg in
temp_args:
/path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)298
 raise Py4JJavaError(299 'An error
occurred while calling {0}{1}{2}.\n'.--> 300
format(target_id, '.', name), value)301 else:302
  raise Py4JError(
Py4JJavaError: An error occurred while calling o76.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result: java.io.IOException: Failed to
connect to cluster_node/IP_address:port
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




For this example, ideally, this query should output the 400,000 row
resultset.


Thanks for your help,
*Nikolay Voronchikhin*
https://www.linkedin.com/in/nvoronchikhin

*E-mail: nvoronchik...@gmail.com *

* *


Re: Facing issues while reading parquet file in spark 1.2.1

2016-05-26 Thread vaibhav srivastava
Any suggestions?
On 25 May 2016 17:25, "vaibhav srivastava"  wrote:

> Hi,
> I am using spark 1.2.1. when I am trying to read a parquet file using SQL
> context.parquetFile("path to file") . The parquet file is using
> parquethiveserde and input format is mapredParquetInputFormat.
>
> Thanks
> Vaibhav.
> On 25 May 2016 17:03, "Takeshi Yamamuro"  wrote:
>
>> Hi,
>>
>> You need to describe more to make others easily understood;
>> what's the version of spark and what's the query you use?
>>
>> // maropu
>>
>>
>> On Wed, May 25, 2016 at 8:27 PM, vaibhav srivastava <
>> vaibhavcs...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>  I am facing below stack traces while reading data from parquet file
>>>
>>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
>>>
>>> at parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:247)
>>>
>>> at
>>> parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:47)
>>>
>>> at
>>> parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
>>>
>>> at
>>> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
>>>
>>> at
>>> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
>>>
>>> at
>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
>>>
>>> at
>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:389)
>>>
>>> at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>>>
>>> at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:457)
>>>
>>> at scala.Option.map(Option.scala:145)
>>>
>>> at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:457)
>>>
>>> at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
>>>
>>> at
>>> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165)
>>>
>>> Please suggest. It seems like it not able to convert some data
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-26 Thread Alonso Isidoro Roman
Hi Matthias and Cody,

You can see in the code that StreamingContext.start() is called after the
messages.foreachRDD output action. Another problem @Cody is how can i avoid
the inner .foreachRDD(_.foreachPartition(it =>
recommender.predictWithALS(it.toSeq))) in order to invoke asynchronously
recommender.predictWithALS which runs a machine learning ALS implementation
with a message from the kafka topic?.

In the actual code i am not using for now any code to save data within the
mongo instance, for now, it is more important to be focus in how to receive
the message from the kafka topic and feeding asynchronously the ALS
implementation. Probably the Recommender object will need the code for
 interact with the mongo instance.

The idea of the process is to receive data from the kafka topic, calculate
its recommendations based on the incoming message and save the results
within a mongo instance. Is it possible?  Am i missing something important?

def main(args: Array[String]) {
// Process program arguments and set properties

if (args.length < 2) {
  System.err.println("Usage: " + this.getClass.getSimpleName + "
 ")
  System.exit(1)
}

val Array(brokers, topics) = args

println("Initializing Streaming Spark Context and kafka connector...")
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
   .setMaster("local[4]")

.set("spark.driver.allowMultipleContexts", "true")

val sc = new SparkContext(sparkConf)

sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//this checkpointdir should be in a conf file, for now it is hardcoded!
val streamingCheckpointDir =
"/Users/aironman/my-recommendation-spark-engine/checkpoint"
ssc.checkpoint(streamingCheckpointDir)

// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
println("Initialized Streaming Spark Context and kafka connector...")

//create recomendation module
println("Creating rating recommender module...")
val ratingFile= "ratings.csv"
val recommender = new Recommender(sc,ratingFile)
println("Initialized rating recommender module...")

//i have to convert messages which is a InputDStream into a
Seq[AmazonRating]...
try{
messages.foreachRDD( rdd =>{
  val count = rdd.count()
  if (count > 0){
//someMessages should be AmazonRating...
val someMessages = rdd.take(count.toInt)
println("<-->")
println("someMessages is " + someMessages)
someMessages.foreach(println)
println("<-->")
println("<---POSSIBLE SOLUTION--->")

messages
.map { case (_, jsonRating) =>
  val jsValue = Json.parse(jsonRating)
  AmazonRating.amazonRatingFormat.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
  }
 }
.filter(_ != AmazonRating.empty)
//probably is not a good idea to do this...
.foreachRDD(_.foreachPartition(it =>
recommender.predictWithALS(it.toSeq)))

println("<---POSSIBLE SOLUTION--->")

  }
  }
)
}catch{
  case e: IllegalArgumentException => {println("illegal arg.
exception")};
  case e: IllegalStateException=> {println("illegal state
exception")};
  case e: ClassCastException   => {println("ClassCastException")};
  case e: Exception=> {println(" Generic Exception")};
}finally{

  println("Finished taking data from kafka topic...")
}

//println("jsonParsed is " + jsonParsed)
//The idea is to save results from Recommender.predict within mongodb,
so i will have to deal with this issue
//after resolving the issue of
.foreachRDD(_.foreachPartition(recommender.predict(_.toSeq)))

*ssc.start()*
ssc.awaitTermination()

println("Finished!")
  }
}

Thank you for reading until here, please, i need your assistance.

Regards


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-05-25 17:33 GMT+02:00 Alonso Isidoro Roman :

> Hi Matthias and Cody, thanks for the answer. This is the code that is
> raising the runtime exception:
>
> try{
> messages.foreachRDD( rdd =>{
>   val count = rdd.count()
>   if (count > 0){
> //someMessages should be AmazonRating...
> val someMessages = rdd.take(count.toInt)
> println("<-->")
> println("someMessages is " + someMessages)
> 

Re: Pros and Cons

2016-05-26 Thread Jörn Franke

Spark can handle this true, but it is optimized for the idea that it works it 
works on the same full dataset in-memory due to the underlying nature of 
machine learning algorithms (iterative). Of course, you can spill over, but 
that you should avoid.

That being said you should have read my final sentence about this. Both systems 
develop and change.


> On 25 May 2016, at 22:14, Reynold Xin  wrote:
> 
> 
>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke  wrote:
>> Spark is more for machine learning working iteravely over the whole same 
>> dataset in memory. Additionally it has streaming and graph processing 
>> capabilities that can be used together. 
> 
> Hi Jörn,
> 
> The first part is actually no true. Spark can handle data far greater than 
> the aggregate memory available on a cluster. The more recent versions (1.3+) 
> of Spark have external operations for almost all built-in operators, and 
> while things may not be perfect, those external operators are becoming more 
> and more robust with each version of Spark.
> 
> 
> 
> 
>