Re: RFC: Remote "HBaseTest" from examples?

2016-08-18 Thread Ignacio Zendejas
I'm very late to this party and I get hbase-spark... what's the
recommendation for pyspark + hbase? I realize this isn't necessarily a
concern of the spark project, but it'd be nice to at least document it here
with a very short and sweet response because I haven't found anything
useful in the wild besides using the approach in the examples with
pythonconverters, which were dropped in 2.0.

Thanks.

On Thu, Apr 21, 2016 at 1:47 PM, Ted Yu  wrote:

> Zhan:
> I have mentioned the JIRA numbers in the thread starting with (note the
> typo in subject of this thread):
>
> RFC: Remove ...
>
> On Thu, Apr 21, 2016 at 1:28 PM, Zhan Zhang 
> wrote:
>
>> FYI: There are several pending patches for DataFrame support on top of
>> HBase.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Apr 20, 2016, at 2:43 AM, Saisai Shao  wrote:
>>
>> +1, HBaseTest in Spark Example is quite old and obsolete, the HBase
>> connector in HBase repo has evolved a lot, it would be better to guide user
>> to refer to that not here in Spark example. So good to remove it.
>>
>> Thanks
>> Saisai
>>
>> On Wed, Apr 20, 2016 at 1:41 AM, Josh Rosen 
>> wrote:
>>
>>> +1; I think that it's preferable for code examples, especially
>>> third-party integration examples, to live outside of Spark.
>>>
>>> On Tue, Apr 19, 2016 at 10:29 AM Reynold Xin 
>>> wrote:
>>>
 Yea in general I feel examples that bring in a large amount of
 dependencies should be outside Spark.


 On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
 wrote:

> Hey all,
>
> Two reasons why I think we should remove that from the examples:
>
> - HBase now has Spark integration in its own repo, so that really
> should be the template for how to use HBase from Spark, making that
> example less useful, even misleading.
>
> - It brings up a lot of extra dependencies that make the size of the
> Spark distribution grow.
>
> Any reason why we shouldn't drop that example?
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

>>
>>
>


Parquet partitioning / appends

2016-08-18 Thread Jeremy Smith
Hi,

I'm running into an issue wherein Spark (both 1.6.1 and 2.0.0) will fail
with a GC Overhead limit when creating a DataFrame from a parquet-backed
partitioned Hive table with a relatively large number of parquet files (~
175 partitions, and each partition contains many parquet files).  If I then
use Hive directly to create a new table from the partitioned table with
CREATE TABLE AS, Hive completes that with no problem and Spark then has no
problem reading the resulting table.

Part of the problem is that whenever we insert records to a parquet table,
it creates a new parquet file; this results in many small parquet files for
a streaming job. Since HDFS supports file appending, couldn't the records
be appended to the existing parquet file as a new row group? If I
understand correctly, this would be pretty straightforward - append the new
data pages and then write a copy of the existing footer with the new row
groups included.  It wouldn't be as optimal as creating a whole new parquet
file including all the data, but it would be much better than creating many
small files (for many different reasons, including the crash case above).
And I'm sure I can't be the only one struggling with streaming output to
parquet.

I know the typical solution to this is to periodically compact the small
files into larger files, but it seems like parquet ought to be appendable
as-is - which would obviate the need for that.

Here's a partial trace of the error for reference:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
java.io.ObjectStreamClass.getClassDataLayout0(ObjectStreamClass.java:1251)
at
java.io.ObjectStreamClass.getClassDataLayout(ObjectStreamClass.java:1195)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1885)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)

Thanks,
Jeremy


Early Draft Structured Streaming Machine Learning

2016-08-18 Thread Holden Karau
Hi Everyone (that cares about structured streaming and ML),

Seth and I have been giving some thought to support structured streaming in
machine learning - we've put together an early design doc (its been in JIRA
(SPARK-16424)  for
awhile, but incase you missed it) we'd love your comments.

Also if anyone happens to be in San Francisco this week and would like to
chat over coffee or bubble tea (I know some people on the list already get
bobba tea as a perk so in that case we can get pastries) feel free to reach
out to me on or off-list (
https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing
) and we will update the list with the result of any such discussion.

Cheers,

Holden :)

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Setting YARN executors' JAVA_HOME

2016-08-18 Thread Ryan Williams
Ah, I guess I missed that by only looking in the YARN config docs, but this
is a more general parameter and not documented there. Thanks!

On Thu, Aug 18, 2016 at 2:51 PM dhruve ashar  wrote:

> Hi Ryan,
>
> You can get more info on this here:  Spark documentation
> .
>
> The page addresses what you need. You can look for 
> spark.executorEnv.[EnvironmentVariableName]
> and set your java home as
> spark.executorEnv.JAVA_HOME=
>
> Regards,
> Dhruve
>
> On Thu, Aug 18, 2016 at 12:49 PM, Ryan Williams <
> ryan.blake.willi...@gmail.com> wrote:
>
>> I need to tell YARN a JAVA_HOME to use when spawning containers (to run a
>> Java 8 app on Java 7 YARN).
>>
>> The only way I've found that works is
>> setting SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java8".
>>
>> The code
>> 
>> implies that this is deprecated and users should use "the config", but I
>> can't figure out what config is being referenced.
>>
>> Passing "--conf spark.yarn.appMasterEnv.JAVA_HOME=/path/to/java8" seems
>> to set it for the AM but not for executors.
>>
>> Likewise, spark.executor.extraLibraryPath and
>> spark.driver.extraLibraryPath don't appear to set JAVA_HOME (and maybe
>> aren't even supposed to?).
>>
>> The 1.0.1 docs
>> 
>>  are the last ones to reference the SPARK_YARN_USER_ENV var, afaict.
>>
>> What's the preferred way of passing YARN a custom JAVA_HOME that will be
>> applied to executors' containers?
>>
>> Thanks!
>>
>
>
>
> --
> -Dhruve Ashar
>
>


Re: Setting YARN executors' JAVA_HOME

2016-08-18 Thread dhruve ashar
Hi Ryan,

You can get more info on this here:  Spark documentation
.

The page addresses what you need. You can look for
spark.executorEnv.[EnvironmentVariableName]
and set your java home as
spark.executorEnv.JAVA_HOME=

Regards,
Dhruve

On Thu, Aug 18, 2016 at 12:49 PM, Ryan Williams <
ryan.blake.willi...@gmail.com> wrote:

> I need to tell YARN a JAVA_HOME to use when spawning containers (to run a
> Java 8 app on Java 7 YARN).
>
> The only way I've found that works is setting SPARK_YARN_USER_ENV="
> JAVA_HOME=/path/to/java8".
>
> The code
> 
> implies that this is deprecated and users should use "the config", but I
> can't figure out what config is being referenced.
>
> Passing "--conf spark.yarn.appMasterEnv.JAVA_HOME=/path/to/java8" seems
> to set it for the AM but not for executors.
>
> Likewise, spark.executor.extraLibraryPath and
> spark.driver.extraLibraryPath don't appear to set JAVA_HOME (and maybe
> aren't even supposed to?).
>
> The 1.0.1 docs
> 
>  are the last ones to reference the SPARK_YARN_USER_ENV var, afaict.
>
> What's the preferred way of passing YARN a custom JAVA_HOME that will be
> applied to executors' containers?
>
> Thanks!
>



-- 
-Dhruve Ashar


Setting YARN executors' JAVA_HOME

2016-08-18 Thread Ryan Williams
I need to tell YARN a JAVA_HOME to use when spawning containers (to run a
Java 8 app on Java 7 YARN).

The only way I've found that works is
setting SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java8".

The code

implies that this is deprecated and users should use "the config", but I
can't figure out what config is being referenced.

Passing "--conf spark.yarn.appMasterEnv.JAVA_HOME=/path/to/java8" seems to
set it for the AM but not for executors.

Likewise, spark.executor.extraLibraryPath and spark.driver.extraLibraryPath
don't appear to set JAVA_HOME (and maybe aren't even supposed to?).

The 1.0.1 docs

 are the last ones to reference the SPARK_YARN_USER_ENV var, afaict.

What's the preferred way of passing YARN a custom JAVA_HOME that will be
applied to executors' containers?

Thanks!


Re: How to convert spark data-frame to datasets?

2016-08-18 Thread Oscar Batori
>From the docs
,
DataFrame is just Dataset[Row]. The are various converters for subtypes of
Product if you want, using "as[T]", where T <: Product, or there is an
implicit decoder in scope, I believe.

Also, this is probably a user list question.


On Thu, Aug 18, 2016 at 10:59 AM Minudika Malshan 
wrote:

> Hi all,
>
> Most of Spark ML algorithms requires a dataset to train the model.
> I would like to know how to convert a spark *data-frame* to a *dataset*
> using Java.
> Your support is much appreciated.
>
> Thank you!
> Minudika
>


How to convert spark data-frame to datasets?

2016-08-18 Thread Minudika Malshan
Hi all,

Most of Spark ML algorithms requires a dataset to train the model.
I would like to know how to convert a spark *data-frame* to a *dataset*
using Java.
Your support is much appreciated.

Thank you!
Minudika