Re: Appropriate Apache Users List Uses

2016-02-09 Thread Ryan Victory
Yeah, a little disappointed with this, I wouldn't expect to be sent
unsolicited mail based on my membership to this list.

-Ryan Victory

On Tue, Feb 9, 2016 at 1:36 PM, John Omernik  wrote:

> All, I received this today, is this appropriate list use? Note: This was
> unsolicited.
>
> Thanks
> John
>
>
>
> From: Pierce Lamb 
> 11:57 AM (1 hour ago)
> to me
>
> Hi John,
>
> I saw you on the Spark Mailing List and noticed you worked for * and
> wanted to reach out. My company, SnappyData, just launched an open source
> OLTP + OLAP Database built on Spark. Our lead investor is Pivotal, whose
> largest owner is EMC which makes * like a father figure :)
>
> SnappyData’s goal is two fold: Operationalize Spark and deliver truly
> interactive queries. To do this, we first integrated Spark with an
> in-memory database with a pedigree of production customer deployments:
> GemFireXD (GemXD).
>
> GemXD operationalized Spark via:
>
> -- True high availability
>
> -- A highly concurrent environment
>
> -- An OLTP engine that can process transactions (mutable state)
>
> With GemXD as a storage engine, we packaged SnappyData with Approximate
> Query Processing (AQP) technology. AQP enables interactive response times
> even when data volumes are huge because it allows the developer to trade
> latency for accuracy. AQP queries (SQL queries with a specified error rate)
> execute on sample tables -- tables that have taken a stratified sample of
> the full dataset. As such, AQP queries enable much faster decisions when
> 100% accuracy isn’t needed and sample tables require far fewer resources to
> manage.
>
> If that sounds interesting to you, please check out our Github repo (our
> release is hosted there under “releases”):
>
> https://github.com/SnappyDataInc/snappydata
>
> We also have a technical paper that dives into the architecture:
> http://www.snappydata.io/snappy-industrial
>
> Are you currently using Spark at ? I’d love to set up a call with you
> and hear about how you’re using it and see if SnappyData could be a fit.
>
> In addition to replying to this email, there are many ways to chat with
> us: https://github.com/SnappyDataInc/snappydata#community-support
>
> Hope to hear from you,
>
> Pierce
>
> pl...@snappydata.io
>
> http://www.twitter.com/snappydata
>


Appropriate Apache Users List Uses

2016-02-09 Thread John Omernik
All, I received this today, is this appropriate list use? Note: This was
unsolicited.

Thanks
John



From: Pierce Lamb 
11:57 AM (1 hour ago)
to me

Hi John,

I saw you on the Spark Mailing List and noticed you worked for * and
wanted to reach out. My company, SnappyData, just launched an open source
OLTP + OLAP Database built on Spark. Our lead investor is Pivotal, whose
largest owner is EMC which makes * like a father figure :)

SnappyData’s goal is two fold: Operationalize Spark and deliver truly
interactive queries. To do this, we first integrated Spark with an
in-memory database with a pedigree of production customer deployments:
GemFireXD (GemXD).

GemXD operationalized Spark via:

-- True high availability

-- A highly concurrent environment

-- An OLTP engine that can process transactions (mutable state)

With GemXD as a storage engine, we packaged SnappyData with Approximate
Query Processing (AQP) technology. AQP enables interactive response times
even when data volumes are huge because it allows the developer to trade
latency for accuracy. AQP queries (SQL queries with a specified error rate)
execute on sample tables -- tables that have taken a stratified sample of
the full dataset. As such, AQP queries enable much faster decisions when
100% accuracy isn’t needed and sample tables require far fewer resources to
manage.

If that sounds interesting to you, please check out our Github repo (our
release is hosted there under “releases”):

https://github.com/SnappyDataInc/snappydata

We also have a technical paper that dives into the architecture:
http://www.snappydata.io/snappy-industrial

Are you currently using Spark at ? I’d love to set up a call with you
and hear about how you’re using it and see if SnappyData could be a fit.

In addition to replying to this email, there are many ways to chat with us:
https://github.com/SnappyDataInc/snappydata#community-support

Hope to hear from you,

Pierce

pl...@snappydata.io

http://www.twitter.com/snappydata


Re: Spark with .NET

2016-02-09 Thread Arko Provo Mukherjee
Doesn't seem to be supported, but thanks! I will probably write some .NET
wrapper in my front end and use the java api in the backend.
Warm regards
Arko


On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu  wrote:

> This thread is related:
> http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+
>
> On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
>> Hello,
>>
>> I want to use Spark (preferable Spark SQL) using C#. Anyone has any
>> pointers to that?
>>
>> Thanks & regards
>> Arko
>>
>>
>


Spark with .NET

2016-02-09 Thread Arko Provo Mukherjee
Hello,

I want to use Spark (preferable Spark SQL) using C#. Anyone has any
pointers to that?

Thanks & regards
Arko


spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
has anyone successfully connected to hive metastore using spark 1.6.0? i am
having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
launching spark with yarn.

this is what i see in logs:
16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore with
URI thrift://metastore.mycompany.com:9083
16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.

and then a little later:

16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
version 1.2.1
16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
2.6.0-cdh5.4.4
16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
hive.server2.enable.impersonation does not exist
16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize called
16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
datanucleus.cache.level2 unknown - will be ignored
16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
present in CLASSPATH (or one of dependencies)
16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
present in CLASSPATH (or one of dependencies)
16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
hive.server2.enable.impersonation does not exist
16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object pin
classes with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does not have its own datastore table.
16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
underlying DB is DERBY
16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
  at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
  at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
  at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
  at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
  at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
  at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
  at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
  at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
  at scala.collection.Iterator$class.foreach(Iterator.scala:742)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
  at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:97)
  at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.repl.Main$.createSQLContext(Main.scala:89)
  ... 47 elided
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
  at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
  at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
  at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
  at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
  at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
  at 

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
hey thanks. hive-site is on classpath in conf directory

i currently got it to work by changing this hive setting in hive-site.xml:
hive.metastore.schema.verification=true
to
hive.metastore.schema.verification=false

this feels like a hack, because schema verification is a good thing i would
assume?

On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev  wrote:

> Hi Koert,
>
> As far as I can see you are using derby:
>
>  Using direct SQL, underlying DB is DERBY
>
> not mysql, which is used for the metastore. That means, spark couldn't
> find hive-site.xml on your classpath. Can you check that, please?
>
> Thanks, Alex.
>
> On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  wrote:
>
>> has anyone successfully connected to hive metastore using spark 1.6.0? i
>> am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
>> launching spark with yarn.
>>
>> this is what i see in logs:
>> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore
>> with URI thrift://metastore.mycompany.com:9083
>> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>>
>> and then a little later:
>>
>> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
>> version 1.2.1
>> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
>> 2.6.0-cdh5.4.4
>> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
>> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
>> hive.server2.enable.impersonation does not exist
>> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store with
>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
>> called
>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>> datanucleus.cache.level2 unknown - will be ignored
>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>> present in CLASSPATH (or one of dependencies)
>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>> present in CLASSPATH (or one of dependencies)
>> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
>> hive.server2.enable.impersonation does not exist
>> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object
>> pin classes with
>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
>> underlying DB is DERBY
>> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>   at
>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>>   at
>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
>>   at
>> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
>>   at
>> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
>>   at
>> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
>>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
>>   at
>> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>>   at
>> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:97)
>>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>   at
>> 

Re: Appropriate Apache Users List Uses

2016-02-09 Thread u...@moosheimer.com
I wouldn't expect this either.
Very disappointing...

-Kay-Uwe Moosheimer

> Am 09.02.2016 um 20:53 schrieb Ryan Victory :
> 
> Yeah, a little disappointed with this, I wouldn't expect to be sent 
> unsolicited mail based on my membership to this list.
> 
> -Ryan Victory
> 
>> On Tue, Feb 9, 2016 at 1:36 PM, John Omernik  wrote:
>> All, I received this today, is this appropriate list use? Note: This was 
>> unsolicited. 
>> 
>> Thanks
>> John
>> 
>> 
>> 
>> From: Pierce Lamb 
>> 
>> 11:57 AM (1 hour ago)
>> 
>> 
>> 
>> to me
>> 
>> Hi John,
>> 
>> I saw you on the Spark Mailing List and noticed you worked for * and 
>> wanted to reach out. My company, SnappyData, just launched an open source 
>> OLTP + OLAP Database built on Spark. Our lead investor is Pivotal, whose 
>> largest owner is EMC which makes * like a father figure :)
>> 
>> SnappyData’s goal is two fold: Operationalize Spark and deliver truly 
>> interactive queries. To do this, we first integrated Spark with an in-memory 
>> database with a pedigree of production customer deployments: GemFireXD 
>> (GemXD).
>> 
>> GemXD operationalized Spark via:
>> -- True high availability
>> -- A highly concurrent environment
>> -- An OLTP engine that can process transactions (mutable state)
>> 
>> With GemXD as a storage engine, we packaged SnappyData with Approximate 
>> Query Processing (AQP) technology. AQP enables interactive response times 
>> even when data volumes are huge because it allows the developer to trade 
>> latency for accuracy. AQP queries (SQL queries with a specified error rate) 
>> execute on sample tables -- tables that have taken a stratified sample of 
>> the full dataset. As such, AQP queries enable much faster decisions when 
>> 100% accuracy isn’t needed and sample tables require far fewer resources to 
>> manage.
>> 
>> If that sounds interesting to you, please check out our Github repo (our 
>> release is hosted there under “releases”):
>> https://github.com/SnappyDataInc/snappydata
>> 
>> We also have a technical paper that dives into the architecture: 
>> http://www.snappydata.io/snappy-industrial
>> 
>> Are you currently using Spark at ? I’d love to set up a call with you 
>> and hear about how you’re using it and see if SnappyData could be a fit.
>> 
>> In addition to replying to this email, there are many ways to chat with us: 
>> https://github.com/SnappyDataInc/snappydata#community-support
>> 
>> Hope to hear from you,
>> 
>> Pierce
>> pl...@snappydata.io
>> http://www.twitter.com/snappydata
> 


Re: Spark with .NET

2016-02-09 Thread Bryan Jeffrey
Arko,

Check this out: https://github.com/Microsoft/SparkCLR

This is a Microsoft authored C# language binding for Spark.

Regards,

Bryan Jeffrey

On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:

> Doesn't seem to be supported, but thanks! I will probably write some .NET
> wrapper in my front end and use the java api in the backend.
> Warm regards
> Arko
>
>
> On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu  wrote:
>
>> This thread is related:
>> http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+
>>
>> On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee <
>> arkoprovomukher...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I want to use Spark (preferable Spark SQL) using C#. Anyone has any
>>> pointers to that?
>>>
>>> Thanks & regards
>>> Arko
>>>
>>>
>>
>


Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
Hi Koert,

As far as I can see you are using derby:

 Using direct SQL, underlying DB is DERBY

not mysql, which is used for the metastore. That means, spark couldn't find
hive-site.xml on your classpath. Can you check that, please?

Thanks, Alex.

On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  wrote:

> has anyone successfully connected to hive metastore using spark 1.6.0? i
> am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
> launching spark with yarn.
>
> this is what i see in logs:
> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore with
> URI thrift://metastore.mycompany.com:9083
> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>
> and then a little later:
>
> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
> version 1.2.1
> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
> 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
> called
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
> datanucleus.cache.level2 unknown - will be ignored
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
> present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
> present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object pin
> classes with
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
> underlying DB is DERBY
> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
>   at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
>   at
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
>   at
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
>   at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:97)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:89)
>   ... 47 elided
> Caused by: java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at
> 

Re: Spark with .NET

2016-02-09 Thread Ted Yu
This thread is related:
http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+

On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:

> Hello,
>
> I want to use Spark (preferable Spark SQL) using C#. Anyone has any
> pointers to that?
>
> Thanks & regards
> Arko
>
>


Re: Bad Digest error while doing aws s3 put

2016-02-09 Thread Steve Loughran

> On 9 Feb 2016, at 07:19, lmk  wrote:
> 
> Hi Dhimant,
> As I had indicated in my next mail, my problem was due to disk getting full
> with log messages (these were dumped into the slaves) and did not have
> anything to do with the content pushed into s3. So, looks like this error
> message is very generic and is thrown for various reasons. You may probably
> have to do some more research to find out the cause of your problem..
> Please keep me posted once you fix this issue. Sorry, I could not be of much
> help to you..
> 
> Regards
> 

that's fun.

s3n/s3a buffer their output until close() is called, then they do a full upload

this breaks every assumption people have about file IO:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html

-especially the bits in the code about close() being fast and harmless; now its 
O(data) and bad news if it fails.

If your close() was failing due to lack of HDD space, it means that your tmp 
dir and log dir were on the same disk/volume, and that ran out of capacity

HADOOP-11183 added an output variant which buffers in memory, primarily for 
faster output to rack-local storage supporting the s3 protocol. This is in ASF 
Hadoop 2.7, recent HDP and CDH releases. 

I don't know if it's in amazon EMR, because they have their own closed source 
EMR client (believed to be a modified ASF one with some special hooks to 
unstable s3 APIs)

Anyway: I would run, not walk, to using s3a on Hadoop 2.7+, as its already 
better than s3a and getting better with every release

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ALS rating caching

2016-02-09 Thread Roberto Pagliari
Hi Nick,
>From which version does that apply? I'm using 1.5.2

Thank you,


From: Nick Pentreath >
Date: Tuesday, 9 February 2016 07:02
To: "user@spark.apache.org" 
>
Subject: Re: ALS rating caching

In the "new" ALS intermediate RDDs (including the ratings input RDD after 
transforming to block-partitioned ratings) is cached using 
intermediateRDDStorageLevel, and you can select the final RDD storage level 
(for user and item factors) using finalRDDStorageLevel.

The old MLLIB API now calls the new ALS so the same semantics apply.

So it should not be necessary to cache the raw input RDD.

On Tue, 9 Feb 2016 at 01:48 Roberto Pagliari 
> wrote:
When using ALS from mllib, would it be better/recommended to cache the ratings 
RDD?

I'm asking because when predicting products for users (for example) it is 
recommended to cache product/user matrices.

Thank you,



spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Alexandr Dzhagriev
Hello all,

I looked through the cassandra spark integration (
https://github.com/datastax/spark-cassandra-connector) and couldn't find
any usages of the BulkOutputWriter (
http://www.datastax.com/dev/blog/bulk-loading) - an awesome tool for
creating local sstables, which could be later uploaded to a cassandra
cluster.  Seems like (sorry if I'm wrong), it uses just bulk insert
statements. So, my question is: does anybody know if there are any plans to
support bulk loading?

Cheers, Alex.


[Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard

All,

I'm new to Spark and I'm having a hard time doing a simple join of two DFs

Intent:
-  I'm receiving data from Kafka via direct stream and would like to  
enrich the messages with data from Cassandra. The Kafka messages  
(Protobufs) are decoded into DataFrames and then joined with a  
(supposedly pre-filtered) DF from Cassandra. The relation of (Kafka)  
streaming batch size to raw C* data is [several streaming messages to  
millions of C* rows], BUT the join always yields exactly ONE result  
[1:1] per message. After the join the resulting DF is eventually  
stored to another C* table.


Problem:
- Even though I'm joining the two DFs on the full Cassandra primary  
key and pushing the corresponding filter to C*, it seems that Spark is  
loading the whole C* data-set into memory before actually joining  
(which I'd like to prevent by using the filter/predicate pushdown).  
This leads to a lot of shuffling and tasks being spawned, hence the  
"simple" join takes forever...


Could anyone shed some light on this? In my perception this should be  
a prime-example for DFs and Spark Streaming.


Environment:
- Spark 1.6
- Cassandra 2.1.12
- Cassandra-Spark-Connector 1.5-RC1
- Kafka 0.8.2.2

Code:

def main(args: Array[String]) {
val conf = new SparkConf()
  .setAppName("test")
  .set("spark.cassandra.connection.host", "xxx")
  .set("spark.cassandra.connection.keep_alive_ms", "3")
  .setMaster("local[*]")

val ssc = new StreamingContext(conf, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")

// Initialise Kafka
val kafkaTopics = Set[String]("xxx")
val kafkaParams = Map[String, String](
  "metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
  "auto.offset.reset" -> "smallest")

// Kafka stream
val messages = KafkaUtils.createDirectStream[String, MyMsg,  
StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)


// Executed on the driver
messages.foreachRDD { rdd =>

  // Create an instance of SQLContext
  val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
  import sqlContext.implicits._

  // Map MyMsg RDD
  val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}

  // Convert RDD[MyMsg] to DataFrame
  val MyMsgDf = MyMsgRdd.toDF()
.select(
$"prim1Id" as 'prim1_id,
$"prim2Id" as 'prim2_id,
$...
  )

  // Load DataFrame from C* data-source
  val base_data = base_data_df.getInstance(sqlContext)

  // Inner join on prim1Id and prim2Id
  val joinedDf = MyMsgDf.join(base_data,
MyMsgDf("prim1_id") === base_data("prim1_id") &&
MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
.filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
&& base_data("prim2_id").isin(MyMsgDf("prim2_id")))

  joinedDf.show()
  joinedDf.printSchema()

  // Select relevant fields

  // Persist

}

// Start the computation
ssc.start()
ssc.awaitTermination()
}

SO:  
http://stackoverflow.com/questions/35295182/joining-kafka-and-cassandra-dataframes-in-spark-streaming-ignores-c-predicate-p




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
Hello All,

joinWith() method in Dataset takes a condition of type Column. Without
converting a Dataset to a DataFrame, how can we get a specific column?

For eg: case class Pair(x: Long, y: Long)

A, B are Datasets of type Pair and I want to join A.x with B.y

A.joinWith(B, A.toDF().col("x") == B.toDF().col("y"))

Is there a way to avoid using toDF()?

I am having similar issues with the usage of filter(A.x == B.y)

-- 
Regards,
Raghava


Re: Dataset joinWith condition

2016-02-09 Thread Ted Yu
Please take a look at:
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

val ds1 = Seq(1, 2, 3).toDS().as("a")
val ds2 = Seq(1, 2).toDS().as("b")

checkAnswer(
  ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"),

On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju  wrote:

> Hello All,
>
> joinWith() method in Dataset takes a condition of type Column. Without
> converting a Dataset to a DataFrame, how can we get a specific column?
>
> For eg: case class Pair(x: Long, y: Long)
>
> A, B are Datasets of type Pair and I want to join A.x with B.y
>
> A.joinWith(B, A.toDF().col("x") == B.toDF().col("y"))
>
> Is there a way to avoid using toDF()?
>
> I am having similar issues with the usage of filter(A.x == B.y)
>
> --
> Regards,
> Raghava
>


createDataFrame question

2016-02-09 Thread jdkorigan
Hi,

I would like to transform my rdd to a sql.dataframe.Dataframe, is there a
possible conversion to do the job? or what would be the easiest way to do
it?
 
def ConvertVal(iter):
# some code
return sqlContext.createDataFrame(Row("val1", "val2", "val3", "val4"))


rdd = sc.textFile("").mapPartitions(ConvertVal)

print(type(rdd)) #



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/createDataFrame-question-tp26178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: createDataFrame question

2016-02-09 Thread satish chandra j
HI,
Hope you are aware of "toDF()" which is used to convert your RDD to
DataFrame

Regards,
Satish Chandra

On Tue, Feb 9, 2016 at 5:52 PM, jdkorigan  wrote:

> Hi,
>
> I would like to transform my rdd to a sql.dataframe.Dataframe, is there a
> possible conversion to do the job? or what would be the easiest way to do
> it?
>
> def ConvertVal(iter):
> # some code
> return sqlContext.createDataFrame(Row("val1", "val2", "val3", "val4"))
>
>
> rdd = sc.textFile("").mapPartitions(ConvertVal)
>
> print(type(rdd)) #
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/createDataFrame-question-tp26178.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
For sql shuffle operations like groupby, the number of output partitions is
controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not
honour this.

In my small test, I could see that the number of partitions  in DF returned
by orderBy was equal to the total number of distinct keys. Are you
observing the same, I mean do you have a single value for all rows in the
column on which you are running orderBy? If yes, you are better off not
running the orderBy clause.

May be someone from spark sql team could answer that how should the
partitioning of the output DF be handled when doing an orderBy?

Hemant
www.snappydata.io
https://github.com/SnappyDataInc/snappydata




On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores  wrote:

>
> I have a data frame which I sort using orderBy function. This operation
> causes my data frame to go to a single partition. After using those
> results, I would like to re-partition to a larger number of partitions.
> Currently I am just doing:
>
> val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
> partition and around 14 million records
> val newDF =  hc.createDataFrame(rdd, df.schema)
>
> This process is really slow. Is there any other way of achieving this
> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>
>
> Thanks a lot
> --
> Cesar Flores
>


Re: Can't view executor logs in web UI on Windows

2016-02-09 Thread KhajaAsmath Mohammed
Hi,

I am new to spark and trying to learn doing some programs on windows. I
faced the same issue when running on windows. Cannot open Spark WebUI, I
can see the output and output folder has the information that I needed but
logs states that the WebUI is stopped. Does anyone have solution to view
WebUI in windows?

Thanks,
Asmath

On Tue, Feb 9, 2016 at 9:58 AM, Mark Pavey  wrote:

> I have submitted a pull request:
> https://github.com/apache/spark/pull/11135.
>
> Mark
>
>
> -Original Message-
> From: Mark Pavey [mailto:mark.pa...@thefilter.com]
> Sent: 05 February 2016 17:09
> To: 'Ted Yu'
> Cc: user@spark.apache.org
> Subject: RE: Can't view executor logs in web UI on Windows
>
> We have created JIRA ticket
> https://issues.apache.org/jira/browse/SPARK-13142 and will submit a pull
> request next week.
>
> Mark
>
>
> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: 01 February 2016 14:24
> To: Mark Pavey
> Cc: user@spark.apache.org
> Subject: Re: Can't view executor logs in web UI on Windows
>
> I did a brief search but didn't find relevant JIRA either.
>
> You can create a JIRA and submit pull request for the fix.
>
> Cheers
>
> > On Feb 1, 2016, at 5:13 AM, Mark Pavey  wrote:
> >
> > I am running Spark on Windows. When I try to view the Executor logs in
> > the UI I get the following error:
> >
> > HTTP ERROR 500
> >
> > Problem accessing /logPage/. Reason:
> >
> >Server Error
> > Caused by:
> >
> > java.net.URISyntaxException: Illegal character in path at index 1:
> > .\work/app-20160129154716-0038/2/
> >at java.net.URI$Parser.fail(Unknown Source)
> >at java.net.URI$Parser.checkChars(Unknown Source)
> >at java.net.URI$Parser.parseHierarchical(Unknown Source)
> >at java.net.URI$Parser.parse(Unknown Source)
> >at java.net.URI.(Unknown Source)
> >at org.apache.spark.deploy.worker.ui.LogPage.getLog(LogPage.scala:141)
> >at org.apache.spark.deploy.worker.ui.LogPage.render(LogPage.scala:78)
> >at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> >at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> >at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:69)
> >at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
> >at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
> >at
> >
>
> org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
> >at
> >
>
> org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
> 501)
> >at
> >
>
> org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandle
> r.java:1086)
> >at
> >
>
> org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:4
> 28)
> >at
> >
>
> org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler
> .java:1020)
> >at
> >
>
> org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> va:135)
> >at
> >
>
> org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:2
> 64)
> >at
> >
>
> org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(Conte
> xtHandlerCollection.java:255)
> >at
> >
>
> org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.
> java:116)
> >at org.spark-project.jetty.server.Server.handle(Server.java:370)
> >at
> >
>
> org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(Abstract
> HttpConnection.java:494)
> >at
> >
>
> org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(Abstrac
> tHttpConnection.java:971)
> >at
> >
>
> org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerC
> omplete(AbstractHttpConnection.java:1033)
> >at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
> >at
> >
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> >at
> >
>
> org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnectio
> n.java:82)
> >at
> >
>
> org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEnd
> Point.java:667)
> >at
> >
>
> org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndP
> oint.java:52)
> >at
> >
>
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool
> .java:608)
> >at
> >
>
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.
> java:543)
> >at java.lang.Thread.run(Unknown Source)
> >
> >
> >
> > Looking at the source code for
> > org.apache.spark.deploy.worker.ui.LogPage.getLog reveals the following:
> > - At line 141 the constructor of java.net.URI is called with the path
> > to the log directory as a String argument. This string
> > (".\work/app-20160129154716-0038/2/" in example above) contains a
> > backslash, which is an illegal character for the URI constructor.
> > - The component 

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
Hi,

DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
`HashPartitioning`.
`RangePartitioning` roughly samples input data and internally computes
partition bounds
to split given rows into `spark.sql.shuffle.partitions` partitions.
Therefore, when sort keys are highly skewed, I think some partitions could
end up being empty
(that is, # of result partitions is lower than `spark.sql.shuffle.partitions`
.


On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat 
wrote:

> For sql shuffle operations like groupby, the number of output partitions
> is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does
> not honour this.
>
> In my small test, I could see that the number of partitions  in DF
> returned by orderBy was equal to the total number of distinct keys. Are you
> observing the same, I mean do you have a single value for all rows in the
> column on which you are running orderBy? If yes, you are better off not
> running the orderBy clause.
>
> May be someone from spark sql team could answer that how should the
> partitioning of the output DF be handled when doing an orderBy?
>
> Hemant
> www.snappydata.io
> https://github.com/SnappyDataInc/snappydata
>
>
>
>
> On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores  wrote:
>
>>
>> I have a data frame which I sort using orderBy function. This operation
>> causes my data frame to go to a single partition. After using those
>> results, I would like to re-partition to a larger number of partitions.
>> Currently I am just doing:
>>
>> val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
>> partition and around 14 million records
>> val newDF =  hc.createDataFrame(rdd, df.schema)
>>
>> This process is really slow. Is there any other way of achieving this
>> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>>
>>
>> Thanks a lot
>> --
>> Cesar Flores
>>
>
>


-- 
---
Takeshi Yamamuro


Re: createDataFrame question

2016-02-09 Thread jdkorigan
When using this function: rdd =
sc.textFile("").mapPartitions(ConvertVal).toDF()

I get an exception and the last line is:
TypeError: 'JavaPackage' object is not callable

Since my function return value is already DataFrame, maybe there is a way to
access this type from my rdd?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/createDataFrame-question-tp26178p26180.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Can't view executor logs in web UI on Windows

2016-02-09 Thread Mark Pavey
I have submitted a pull request: https://github.com/apache/spark/pull/11135.

Mark


-Original Message-
From: Mark Pavey [mailto:mark.pa...@thefilter.com] 
Sent: 05 February 2016 17:09
To: 'Ted Yu'
Cc: user@spark.apache.org
Subject: RE: Can't view executor logs in web UI on Windows

We have created JIRA ticket
https://issues.apache.org/jira/browse/SPARK-13142 and will submit a pull
request next week.

Mark


-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: 01 February 2016 14:24
To: Mark Pavey
Cc: user@spark.apache.org
Subject: Re: Can't view executor logs in web UI on Windows

I did a brief search but didn't find relevant JIRA either. 

You can create a JIRA and submit pull request for the fix. 

Cheers

> On Feb 1, 2016, at 5:13 AM, Mark Pavey  wrote:
> 
> I am running Spark on Windows. When I try to view the Executor logs in 
> the UI I get the following error:
> 
> HTTP ERROR 500
> 
> Problem accessing /logPage/. Reason:
> 
>Server Error
> Caused by:
> 
> java.net.URISyntaxException: Illegal character in path at index 1:
> .\work/app-20160129154716-0038/2/
>at java.net.URI$Parser.fail(Unknown Source)
>at java.net.URI$Parser.checkChars(Unknown Source)
>at java.net.URI$Parser.parseHierarchical(Unknown Source)
>at java.net.URI$Parser.parse(Unknown Source)
>at java.net.URI.(Unknown Source)
>at org.apache.spark.deploy.worker.ui.LogPage.getLog(LogPage.scala:141)
>at org.apache.spark.deploy.worker.ui.LogPage.render(LogPage.scala:78)
>at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
>at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:69)
>at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>at
>
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>at
>
org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
501)
>at
>
org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandle
r.java:1086)
>at
>
org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:4
28)
>at
>
org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler
.java:1020)
>at
>
org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
va:135)
>at
>
org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:2
64)
>at
>
org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(Conte
xtHandlerCollection.java:255)
>at
>
org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.
java:116)
>at org.spark-project.jetty.server.Server.handle(Server.java:370)
>at
>
org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(Abstract
HttpConnection.java:494)
>at
>
org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(Abstrac
tHttpConnection.java:971)
>at
>
org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerC
omplete(AbstractHttpConnection.java:1033)
>at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>at
>
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>at
>
org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnectio
n.java:82)
>at
>
org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEnd
Point.java:667)
>at
>
org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndP
oint.java:52)
>at
>
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool
.java:608)
>at
>
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.
java:543)
>at java.lang.Thread.run(Unknown Source)
> 
> 
> 
> Looking at the source code for
> org.apache.spark.deploy.worker.ui.LogPage.getLog reveals the following:
> - At line 141 the constructor of java.net.URI is called with the path 
> to the log directory as a String argument. This string 
> (".\work/app-20160129154716-0038/2/" in example above) contains a 
> backslash, which is an illegal character for the URI constructor.
> - The component of the path containing the backslash is created at 
> line 71 by calling the getPath method on a java.io.File object.
> Because it is running on Windows it uses the default Windows file 
> separator, which is a backslash.
> 
> I am using Spark 1.5.1 but the source code appears unchanged in 1.6.0.
> 
> I haven't been able to find an open issue for this but if there is one 
> could possibly submit a pull request for it.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-view-executo
> r-logs-in-web-UI-on-Windows-tp26122.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> 

jssc.textFileStream(directory) how to ensure it read entire all incoming files

2016-02-09 Thread unk1102
Hi my actual use case is streaming text files in HDFS directory and send it
to Kafka please let me know if is there any existing solution for this.
Anyways I have the following code 

//lets assume directory contains one file a.txt and it has 100 lines 
JavaDStream logData = jssc.textFileStream(directory);

//how do I make sure jssc.textFileStream() read all 100 lines so I can
delete it later

Also please let me know how do I send above logData to Kafka using Spark
streaming



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jssc-textFileStream-directory-how-to-ensure-it-read-entire-all-incoming-files-tp26181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: createDataFrame question

2016-02-09 Thread jdkorigan
The correct way, is just to remove "sqlContext.createDataFrame", and
everything works correctly

def ConvertVal(iter): 
# some code 
return Row("val1", "val2", "val3", "val4") 

rdd = sc.textFile("").mapPartitions(ConvertVal).toDF()



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/createDataFrame-question-tp26178p26182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Rachana Srivastava
I am trying to run an application in yarn cluster mode.

Spark-Submit with Yarn Cluster
Here are setting of the shell script:
spark-submit --class "com.Myclass"  \
--num-executors 2 \
--executor-cores 2 \
--master yarn \
--supervise \
--deploy-mode cluster \
../target/ \

My application is working fine in yarn-client and local mode.

Excerpt for error when we submit application from spark-submit in yarn cluster 
mode.

&& HADOOP HOME correct path logged but still getting the 
error
/usr/lib/hadoop
&& HADOOP_CONF_DIR
/usr/lib/hadoop/etc/hadoop
...
Diagnostics: Exception from container-launch.
Container id: container_1454984479786_0006_02_01
Exit code: 15
Stack trace: ExitCodeException exitCode=15:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)

Further I am getting following error
ERROR DETAILS FROM YARN LOGS APPLICATIONID
INFO : org.apache.spark.deploy.yarn.ApplicationMaster - Registered signal 
handlers for [TERM, HUP, INT]
DEBUG: org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home 
directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:307)
at org.apache.hadoop.util.Shell.(Shell.java:332)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at 
org.apache.hadoop.yarn.conf.YarnConfiguration.(YarnConfiguration.java:590)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:52)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:386)
at 
org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:384)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:384)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:401)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:623)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

I tried modifying spark-env.sh like following and I see Hadoop_Home logged but 
still getting above error:
Modified added following entries to spark-env.sh
export HADOOP_HOME="/usr/lib/hadoop"
echo "&& HADOOP HOME "
echo "$HADOOP_HOME"
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
echo "&& HADOOP_CONF_DIR "
echo "$HADOOP_CONF_DIR"


Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores
Well, actually I am observing a single partition no matter what my input
is. I am using spark 1.3.1.

For what you both are saying, it appears that this sorting issue (going to
a single partition after applying orderBy in a DF) is solved in later
version of Spark? Well, if that is the case, I guess I just need to wait
until my workplace decides to update.


Thanks a lot

On Tue, Feb 9, 2016 at 9:39 AM, Takeshi Yamamuro 
wrote:

> Hi,
>
> DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
> `HashPartitioning`.
> `RangePartitioning` roughly samples input data and internally computes
> partition bounds
> to split given rows into `spark.sql.shuffle.partitions` partitions.
> Therefore, when sort keys are highly skewed, I think some partitions could
> end up being empty
> (that is, # of result partitions is lower than `spark.sql.shuffle.partitions`
> .
>
>
> On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat 
> wrote:
>
>> For sql shuffle operations like groupby, the number of output partitions
>> is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does
>> not honour this.
>>
>> In my small test, I could see that the number of partitions  in DF
>> returned by orderBy was equal to the total number of distinct keys. Are you
>> observing the same, I mean do you have a single value for all rows in the
>> column on which you are running orderBy? If yes, you are better off not
>> running the orderBy clause.
>>
>> May be someone from spark sql team could answer that how should the
>> partitioning of the output DF be handled when doing an orderBy?
>>
>> Hemant
>> www.snappydata.io
>> https://github.com/SnappyDataInc/snappydata
>>
>>
>>
>>
>> On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores  wrote:
>>
>>>
>>> I have a data frame which I sort using orderBy function. This operation
>>> causes my data frame to go to a single partition. After using those
>>> results, I would like to re-partition to a larger number of partitions.
>>> Currently I am just doing:
>>>
>>> val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
>>> partition and around 14 million records
>>> val newDF =  hc.createDataFrame(rdd, df.schema)
>>>
>>> This process is really slow. Is there any other way of achieving this
>>> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>>>
>>>
>>> Thanks a lot
>>> --
>>> Cesar Flores
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Cesar Flores


Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
I'm using spark 1.6.0, hive 1.2.1 and there is just one property in the
hive-site.xml hive.metastore.uris Works for me. Can you check in the logs,
that when the HiveContext is created it connects to the correct uri and
doesn't use derby.

Cheers, Alex.

On Tue, Feb 9, 2016 at 9:39 PM, Koert Kuipers  wrote:

> hey thanks. hive-site is on classpath in conf directory
>
> i currently got it to work by changing this hive setting in hive-site.xml:
> hive.metastore.schema.verification=true
> to
> hive.metastore.schema.verification=false
>
> this feels like a hack, because schema verification is a good thing i
> would assume?
>
> On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev 
> wrote:
>
>> Hi Koert,
>>
>> As far as I can see you are using derby:
>>
>>  Using direct SQL, underlying DB is DERBY
>>
>> not mysql, which is used for the metastore. That means, spark couldn't
>> find hive-site.xml on your classpath. Can you check that, please?
>>
>> Thanks, Alex.
>>
>> On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  wrote:
>>
>>> has anyone successfully connected to hive metastore using spark 1.6.0? i
>>> am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
>>> launching spark with yarn.
>>>
>>> this is what i see in logs:
>>> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore
>>> with URI thrift://metastore.mycompany.com:9083
>>> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>>>
>>> and then a little later:
>>>
>>> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
>>> version 1.2.1
>>> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
>>> 2.6.0-cdh5.4.4
>>> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
>>> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
>>> hive.server2.enable.impersonation does not exist
>>> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store
>>> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>>> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
>>> called
>>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>>> datanucleus.cache.level2 unknown - will be ignored
>>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
>>> hive.server2.enable.impersonation does not exist
>>> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object
>>> pin classes with
>>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
>>> underlying DB is DERBY
>>> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
>>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>>> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>>   at
>>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>>>   at
>>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
>>>   at
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
>>>   at
>>> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
>>>   at
>>> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
>>>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
>>>   at
>>> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>>>   at
>>> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>>>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>>>   at 

Re: Spark with .NET

2016-02-09 Thread Ted Yu
Looks like they have some system support whose source is not in the repo:


FYI

On Tue, Feb 9, 2016 at 12:17 PM, Bryan Jeffrey 
wrote:

> Arko,
>
> Check this out: https://github.com/Microsoft/SparkCLR
>
> This is a Microsoft authored C# language binding for Spark.
>
> Regards,
>
> Bryan Jeffrey
>
> On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
>> Doesn't seem to be supported, but thanks! I will probably write some .NET
>> wrapper in my front end and use the java api in the backend.
>> Warm regards
>> Arko
>>
>>
>> On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu  wrote:
>>
>>> This thread is related:
>>> http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+
>>>
>>> On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee <
>>> arkoprovomukher...@gmail.com> wrote:
>>>
 Hello,

 I want to use Spark (preferable Spark SQL) using C#. Anyone has any
 pointers to that?

 Thanks & regards
 Arko


>>>
>>
>


Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Benjamin Kim
I got the same problem when I added the Phoenix plugin jar in the driver and 
executor extra classpaths. Do you have those set too?

> On Feb 9, 2016, at 1:12 PM, Koert Kuipers  wrote:
> 
> yes its not using derby i think: i can see the tables in my actual hive 
> metastore.
> 
> i was using a symlink to /etc/hive/conf/hive-site.xml for my hive-site.xml 
> which has a lot more stuff than just hive.metastore.uris
> 
> let me try your approach
> 
> 
> 
> On Tue, Feb 9, 2016 at 3:57 PM, Alexandr Dzhagriev  > wrote:
> I'm using spark 1.6.0, hive 1.2.1 and there is just one property in the 
> hive-site.xml hive.metastore.uris Works for me. Can you check in the logs, 
> that when the HiveContext is created it connects to the correct uri and 
> doesn't use derby.
> 
> Cheers, Alex.
> 
> On Tue, Feb 9, 2016 at 9:39 PM, Koert Kuipers  > wrote:
> hey thanks. hive-site is on classpath in conf directory
> 
> i currently got it to work by changing this hive setting in hive-site.xml:
> hive.metastore.schema.verification=true
> to
> hive.metastore.schema.verification=false
> 
> this feels like a hack, because schema verification is a good thing i would 
> assume?
> 
> On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev  > wrote:
> Hi Koert,
> 
> As far as I can see you are using derby:
> 
>  Using direct SQL, underlying DB is DERBY
> 
> not mysql, which is used for the metastore. That means, spark couldn't find 
> hive-site.xml on your classpath. Can you check that, please?
> 
> Thanks, Alex.
> 
> On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  > wrote:
> has anyone successfully connected to hive metastore using spark 1.6.0? i am 
> having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and 
> launching spark with yarn.
> 
> this is what i see in logs:
> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore with 
> URI thrift://metastore.mycompany.com:9083 
> 
> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
> 
> and then a little later:
> 
> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive, version 
> 1.2.1
> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version: 
> 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded 
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not 
> present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not 
> present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is DERBY
> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
>   at 
> 

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
yes its not using derby i think: i can see the tables in my actual hive
metastore.

i was using a symlink to /etc/hive/conf/hive-site.xml for my hive-site.xml
which has a lot more stuff than just hive.metastore.uris

let me try your approach



On Tue, Feb 9, 2016 at 3:57 PM, Alexandr Dzhagriev  wrote:

> I'm using spark 1.6.0, hive 1.2.1 and there is just one property in the
> hive-site.xml hive.metastore.uris Works for me. Can you check in the
> logs, that when the HiveContext is created it connects to the correct uri
> and doesn't use derby.
>
> Cheers, Alex.
>
> On Tue, Feb 9, 2016 at 9:39 PM, Koert Kuipers  wrote:
>
>> hey thanks. hive-site is on classpath in conf directory
>>
>> i currently got it to work by changing this hive setting in hive-site.xml:
>> hive.metastore.schema.verification=true
>> to
>> hive.metastore.schema.verification=false
>>
>> this feels like a hack, because schema verification is a good thing i
>> would assume?
>>
>> On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev 
>> wrote:
>>
>>> Hi Koert,
>>>
>>> As far as I can see you are using derby:
>>>
>>>  Using direct SQL, underlying DB is DERBY
>>>
>>> not mysql, which is used for the metastore. That means, spark couldn't
>>> find hive-site.xml on your classpath. Can you check that, please?
>>>
>>> Thanks, Alex.
>>>
>>> On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  wrote:
>>>
 has anyone successfully connected to hive metastore using spark 1.6.0?
 i am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5
 and launching spark with yarn.

 this is what i see in logs:
 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore
 with URI thrift://metastore.mycompany.com:9083
 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.

 and then a little later:

 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
 version 1.2.1
 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
 2.6.0-cdh5.4.4
 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
 org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 
 2.6.0-cdh5.4.4
 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
 hive.server2.enable.impersonation does not exist
 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store
 with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
 called
 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
 datanucleus.cache.level2 unknown - will be ignored
 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
 present in CLASSPATH (or one of dependencies)
 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
 present in CLASSPATH (or one of dependencies)
 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
 hive.server2.enable.impersonation does not exist
 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object
 pin classes with
 hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
 "embedded-only" so does not have its own datastore table.
 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
 "embedded-only" so does not have its own datastore table.
 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
 "embedded-only" so does not have its own datastore table.
 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
 "embedded-only" so does not have its own datastore table.
 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
 underlying DB is DERBY
 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
 java.lang.RuntimeException: java.lang.RuntimeException: Unable to
 instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
   at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
   at
 org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
   at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
   at
 org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
   at
 

Re: Spark with .NET

2016-02-09 Thread Silvio Fiorito
That’s just a .NET assembly (not related to Spark DataSets) but doesn’t look 
like they’re actually using it. It’s typically a default reference pulled in by 
the project templates.

The code though is available from Mono here: 
https://github.com/mono/mono/tree/master/mcs/class/System.Data.DataSetExtensions

From: Ted Yu >
Date: Tuesday, February 9, 2016 at 3:56 PM
To: Bryan Jeffrey >
Cc: Arko Provo Mukherjee 
>, user 
>
Subject: Re: Spark with .NET

Looks like they have some system support whose source is not in the repo:


FYI

On Tue, Feb 9, 2016 at 12:17 PM, Bryan Jeffrey 
> wrote:
Arko,

Check this out: https://github.com/Microsoft/SparkCLR

This is a Microsoft authored C# language binding for Spark.

Regards,

Bryan Jeffrey

On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee 
> wrote:
Doesn't seem to be supported, but thanks! I will probably write some .NET 
wrapper in my front end and use the java api in the backend.
Warm regards
Arko


On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu 
> wrote:
This thread is related:
http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+

On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee 
> wrote:
Hello,

I want to use Spark (preferable Spark SQL) using C#. Anyone has any pointers to 
that?

Thanks & regards
Arko







Learning Fails with 4 Number of Layes at ANN Training with SGDOptimizer

2016-02-09 Thread Hayri Volkan Agun
Hi Everyone,

When MultilayerPerceptronClassifier set to three or four number of layers
and the SGDOptimizer's selected parameters are as follows.

tol : 1e-5
numIter=1
layers : 82,100,30,29
stepSize=0.05
sigmoidFunction in all layers

learning finishes but it doesn't converge. What may be the problem of this.
What should be the parameters?

-- 
Hayri Volkan Agun
PhD. Student - Anadolu University


RE: spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Mohammed Guller
Alex – I suggest posting this question on the Spark Cassandra Connector mailing 
list. The SCC developers are pretty responsive.

Mohammed
Author: Big Data Analytics with 
Spark

From: Alexandr Dzhagriev [mailto:dzh...@gmail.com]
Sent: Tuesday, February 9, 2016 6:52 AM
To: user
Subject: spark-cassandra-connector BulkOutputWriter

Hello all,

I looked through the cassandra spark integration 
(https://github.com/datastax/spark-cassandra-connector) and couldn't find any 
usages of the BulkOutputWriter (http://www.datastax.com/dev/blog/bulk-loading) 
- an awesome tool for creating local sstables, which could be later uploaded to 
a cassandra cluster.  Seems like (sorry if I'm wrong), it uses just bulk insert 
statements. So, my question is: does anybody know if there are any plans to 
support bulk loading?

Cheers, Alex.


RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
You may have better luck with this question on the Spark Cassandra Connector 
mailing list.



One quick question about this code from your email:

   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



What exactly is base_data_df and how are you creating it?

Mohammed
Author: Big Data Analytics with 
Spark



-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 6:58 AM
To: user@spark.apache.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames



All,



I'm new to Spark and I'm having a hard time doing a simple join of two DFs



Intent:

-  I'm receiving data from Kafka via direct stream and would like to enrich the 
messages with data from Cassandra. The Kafka messages

(Protobufs) are decoded into DataFrames and then joined with a (supposedly 
pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size 
to raw C* data is [several streaming messages to millions of C* rows], BUT the 
join always yields exactly ONE result [1:1] per message. After the join the 
resulting DF is eventually stored to another C* table.



Problem:

- Even though I'm joining the two DFs on the full Cassandra primary key and 
pushing the corresponding filter to C*, it seems that Spark is loading the 
whole C* data-set into memory before actually joining (which I'd like to 
prevent by using the filter/predicate pushdown).

This leads to a lot of shuffling and tasks being spawned, hence the "simple" 
join takes forever...



Could anyone shed some light on this? In my perception this should be a 
prime-example for DFs and Spark Streaming.



Environment:

- Spark 1.6

- Cassandra 2.1.12

- Cassandra-Spark-Connector 1.5-RC1

- Kafka 0.8.2.2



Code:



def main(args: Array[String]) {

 val conf = new SparkConf()

   .setAppName("test")

   .set("spark.cassandra.connection.host", "xxx")

   .set("spark.cassandra.connection.keep_alive_ms", "3")

   .setMaster("local[*]")



 val ssc = new StreamingContext(conf, Seconds(10))

 ssc.sparkContext.setLogLevel("INFO")



 // Initialise Kafka

 val kafkaTopics = Set[String]("xxx")

 val kafkaParams = Map[String, String](

   "metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",

   "auto.offset.reset" -> "smallest")



 // Kafka stream

 val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, 
MyMsgDecoder](ssc, kafkaParams, kafkaTopics)



 // Executed on the driver

 messages.foreachRDD { rdd =>



   // Create an instance of SQLContext

   val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)

   import sqlContext.implicits._



   // Map MyMsg RDD

   val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}



   // Convert RDD[MyMsg] to DataFrame

   val MyMsgDf = MyMsgRdd.toDF()

.select(

 $"prim1Id" as 'prim1_id,

 $"prim2Id" as 'prim2_id,

 $...

   )



   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



   // Inner join on prim1Id and prim2Id

   val joinedDf = MyMsgDf.join(base_data,

 MyMsgDf("prim1_id") === base_data("prim1_id") &&

 MyMsgDf("prim2_id") === base_data("prim2_id"), "left")

 .filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))

 && base_data("prim2_id").isin(MyMsgDf("prim2_id")))



   joinedDf.show()

   joinedDf.printSchema()



   // Select relevant fields



   // Persist



 }



 // Start the computation

 ssc.start()

 ssc.awaitTermination()

}



SO:

http://stackoverflow.com/questions/35295182/joining-kafka-and-cassandra-dataframes-in-spark-streaming-ignores-c-predicate-p







-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org




Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Jagat Singh
Hi,

I am using by telling Spark about hive version we are using. This is done
by setting following properties

spark.sql.hive.version
spark.sql.hive.metastore.jars

Thanks


On Wed, Feb 10, 2016 at 7:39 AM, Koert Kuipers  wrote:

> hey thanks. hive-site is on classpath in conf directory
>
> i currently got it to work by changing this hive setting in hive-site.xml:
> hive.metastore.schema.verification=true
> to
> hive.metastore.schema.verification=false
>
> this feels like a hack, because schema verification is a good thing i
> would assume?
>
> On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev 
> wrote:
>
>> Hi Koert,
>>
>> As far as I can see you are using derby:
>>
>>  Using direct SQL, underlying DB is DERBY
>>
>> not mysql, which is used for the metastore. That means, spark couldn't
>> find hive-site.xml on your classpath. Can you check that, please?
>>
>> Thanks, Alex.
>>
>> On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  wrote:
>>
>>> has anyone successfully connected to hive metastore using spark 1.6.0? i
>>> am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
>>> launching spark with yarn.
>>>
>>> this is what i see in logs:
>>> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore
>>> with URI thrift://metastore.mycompany.com:9083
>>> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>>>
>>> and then a little later:
>>>
>>> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
>>> version 1.2.1
>>> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
>>> 2.6.0-cdh5.4.4
>>> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
>>> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
>>> hive.server2.enable.impersonation does not exist
>>> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store
>>> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>>> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
>>> called
>>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>>> datanucleus.cache.level2 unknown - will be ignored
>>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
>>> hive.server2.enable.impersonation does not exist
>>> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object
>>> pin classes with
>>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
>>> underlying DB is DERBY
>>> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
>>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>>> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>>   at
>>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>>>   at
>>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
>>>   at
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
>>>   at
>>> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
>>>   at
>>> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
>>>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
>>>   at
>>> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>>>   at
>>> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>>>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>>>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>>   at 

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
i do not have phoenix, but i wonder if its something related. will check my
classpaths

On Tue, Feb 9, 2016 at 5:00 PM, Benjamin Kim  wrote:

> I got the same problem when I added the Phoenix plugin jar in the driver
> and executor extra classpaths. Do you have those set too?
>
> On Feb 9, 2016, at 1:12 PM, Koert Kuipers  wrote:
>
> yes its not using derby i think: i can see the tables in my actual hive
> metastore.
>
> i was using a symlink to /etc/hive/conf/hive-site.xml for my hive-site.xml
> which has a lot more stuff than just hive.metastore.uris
>
> let me try your approach
>
>
>
> On Tue, Feb 9, 2016 at 3:57 PM, Alexandr Dzhagriev 
> wrote:
>
>> I'm using spark 1.6.0, hive 1.2.1 and there is just one property in the
>> hive-site.xml hive.metastore.uris Works for me. Can you check in the
>> logs, that when the HiveContext is created it connects to the correct uri
>> and doesn't use derby.
>>
>> Cheers, Alex.
>>
>> On Tue, Feb 9, 2016 at 9:39 PM, Koert Kuipers  wrote:
>>
>>> hey thanks. hive-site is on classpath in conf directory
>>>
>>> i currently got it to work by changing this hive setting in
>>> hive-site.xml:
>>> hive.metastore.schema.verification=true
>>> to
>>> hive.metastore.schema.verification=false
>>>
>>> this feels like a hack, because schema verification is a good thing i
>>> would assume?
>>>
>>> On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev 
>>> wrote:
>>>
 Hi Koert,

 As far as I can see you are using derby:

  Using direct SQL, underlying DB is DERBY

 not mysql, which is used for the metastore. That means, spark couldn't
 find hive-site.xml on your classpath. Can you check that, please?

 Thanks, Alex.

 On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers 
 wrote:

> has anyone successfully connected to hive metastore using spark 1.6.0?
> i am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5
> and launching spark with yarn.
>
> this is what i see in logs:
> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore
> with URI thrift://metastore.mycompany.com:9083
> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>
> and then a little later:
>
> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
> version 1.2.1
> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
> 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 
> 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store
> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
> called
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
> datanucleus.cache.level2 unknown - will be ignored
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but
> not present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but
> not present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object
> pin classes with
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
> underlying DB is DERBY
> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate 

How to do a look up by id from files in hdfs inside a transformation/action ina RDD

2016-02-09 Thread SRK
Hi,

How to do a lookup by id from a set of records stored in hdfs from inside a
transformation/action of an RDD.

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-a-look-up-by-id-from-files-in-hdfs-inside-a-transformation-action-ina-RDD-tp26185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to collect/take arbitrary number of records in the driver?

2016-02-09 Thread SRK
Hi ,

How to get a fixed amount of records from an RDD in Driver? Suppose I want
the records from 100 to 1000 and then save them to some external database, I
know that I can do it from Workers in partition but I want to avoid that for
some reasons. The idea is to collect the data to driver and save, although
slowly.

I am looking for something like take(100, 1000)  or take (1000,2000)

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Increase in Processing Time

2016-02-09 Thread Ted Yu
1.4.1 was released half a year ago.

I doubt whether there would be 1.4.x patch release any more.

Please consider upgrading.

On Tue, Feb 9, 2016 at 1:23 PM, Bryan  wrote:

> Ted,
>
>
>
> We are using an inverse reducer function, but we do have a filter function
> in place to cull the key space .
>
>
>
> One thing I am thinking is that this increase in processing time may be
> associated with the ttl expiration time (which is currently 6 hours). It
> may be coincidence however; all streams have zero data so the RDD cleanup
> should be limited in scope. In the storage tab I see 28 bytes retained in
> memory (all other persisted data is size 0).
>
>
>
> I will try changing the ttl way up and see if that changes this hockey
> stick to a later time.
>
>
>
> Do you have other suggestions?
>
>
>
>
>
> Sent from my Windows 10 phone
>
>
>
> *From: *Ted Yu 
> *Sent: *Tuesday, February 9, 2016 4:16 PM
> *To: *Bryan Jeffrey 
> *Cc: *user 
> *Subject: *Re: Spark Increase in Processing Time
>
>
>
> Have you seen this thread ?
>
>
> http://search-hadoop.com/m/q3RTtM6WWs1yUHch2=Re+Spark+streaming+Processing+time+keeps+increasing
>
>
>
> On Tue, Feb 9, 2016 at 12:49 PM, Bryan Jeffrey 
> wrote:
>
> All,
>
>
>
> I am running the following versions:
>
> - Spark 1.4.1
>
> - Scala 2.11
>
> - Kafka 0.8.2.1
>
> - Spark Streaming
>
>
>
> I am seeing my Spark Streaming job increase in processing time after it
> has run for some period.
>
>
>
> [image: Inline image 1]
>
>
>
> If you look at the image above you can see the 'hockey stick' growth.
> This job is processing no input data (all batches have zero events).
> However, after about 4-8 hours (in this case 6 hours) the processing time
> for each job increases by around 20-30% (enough to push over my batch size).
>
>
>
> The job does not have one stage that grows in time - instead all stages
> grow.  I have an example of a given stage below in which we have the same
> set of tasks, etc. that are simply taking longer to complete.  I've labeled
> them 'long' and 'short' respectively.
>
>
>
> Has anyone seen this behavior? Does anyone have ideas on how to correct?
>
> Regards,
>
>
>
> Bryan Jeffrey
>
>
>
> Long Stage:
>
> [image: Inline image 2]
>
>
>
> Short Stage:
>
> [image: Inline image 3]
>
>
>
>
>
>
>
>
>


RE: HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Diwakar Dhanuskodi
It  should  work  which  version  of  spark  are  you  using ?.Try setting it  
up  in  program  using  sparkConf set .


Sent from Samsung Mobile.

 Original message From: 
rachana.srivast...@thomsonreuters.com Date:10/02/2016  00:47  
(GMT+05:30) To: diwakar.dhanusk...@gmail.com, 
rachana.srivast...@markmonitor.com, user@spark.apache.org Cc:  
Subject: RE: HADOOP_HOME are not set when try to run spark 
application in yarn cluster mode 
Thanks so much Diwakar.
 
spark-submit --class "com.MyClass"  \
--files=/usr/lib/hadoop/etc/hadoop/core-site.xml,/usr/lib/hadoop/etc/hadoop/hdfs-site.xml,/usr/lib/hadoop/etc/hadoop/mapred-site.xml,/usr/lib/hadoop/etc/hadoop/ssl-client.xml,/usr/lib/hadoop/etc/hadoop/yarn-site.xml
 \
--num-executors 2 \
--master yarn-cluster \
 
I have added all the xml files in the spark-submit but still getting the same 
error.  I see all the Hadoop files logged.
 
16/02/09 11:07:00 INFO Client: Uploading resource 
file:/usr/lib/hadoop/etc/hadoop/core-site.xml -> 
hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/core-site.xml
16/02/09 11:07:00 INFO Client: Uploading resource 
file:/usr/lib/hadoop/etc/hadoop/hdfs-site.xml -> 
hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/hdfs-site.xml
16/02/09 11:07:00 INFO Client: Uploading resource 
file:/usr/lib/hadoop/etc/hadoop/mapred-site.xml -> 
hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/mapred-site.xml
16/02/09 11:07:00 INFO Client: Uploading resource 
file:/usr/lib/hadoop/etc/hadoop/ssl-client.xml -> 
hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/ssl-client.xml
16/02/09 11:07:00 INFO Client: Uploading resource 
file:/usr/lib/hadoop/etc/hadoop/yarn-site.xml -> 
hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1455041341343_0002/yarn-site.xml
 
From: Diwakar Dhanuskodi [mailto:diwakar.dhanusk...@gmail.com] 
Sent: Tuesday, February 09, 2016 10:00 AM
To: Rachana Srivastava; user@spark.apache.org
Subject: RE: HADOOP_HOME are not set when try to run spark application in yarn 
cluster mode
 
Pass  on  all  hadoop conf files  as  spark-submit parameters in --files
 
 
Sent from Samsung Mobile.
 

 Original message 
From: Rachana Srivastava 
Date:09/02/2016 22:53 (GMT+05:30)
To: user@spark.apache.org
Cc:
Subject: HADOOP_HOME are not set when try to run spark application in yarn 
cluster mode
 
I am trying to run an application in yarn cluster mode.
 
Spark-Submit with Yarn Cluster
Here are setting of the shell script:
spark-submit --class "com.Myclass"  \
--num-executors 2 \
--executor-cores 2 \
--master yarn \
--supervise \
--deploy-mode cluster \
../target/ \
 
My application is working fine in yarn-client and local mode.
 
Excerpt for error when we submit application from spark-submit in yarn cluster 
mode.
 
&& HADOOP HOME correct path logged but still getting the 
error
/usr/lib/hadoop
&& HADOOP_CONF_DIR
/usr/lib/hadoop/etc/hadoop
...
Diagnostics: Exception from container-launch.
Container id: container_1454984479786_0006_02_01
Exit code: 15
Stack trace: ExitCodeException exitCode=15:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 
Further I am getting following error
ERROR DETAILS FROM YARN LOGS APPLICATIONID
INFO : org.apache.spark.deploy.yarn.ApplicationMaster - Registered signal 
handlers for [TERM, HUP, INT]
DEBUG: org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home 
directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:307)
at org.apache.hadoop.util.Shell.(Shell.java:332)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at 
org.apache.hadoop.yarn.conf.YarnConfiguration.(YarnConfiguration.java:590)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:52)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

RE: How to collect/take arbitrary number of records in the driver?

2016-02-09 Thread Mohammed Guller
You can do something like this:



val indexedRDD = rdd.zipWithIndex

val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) && 
(index < 199)}

val result = filteredRDD.take(100)



Warning: the ordering of the elements in the RDD is not guaranteed.

Mohammed
Author: Big Data Analytics with 
Spark



-Original Message-
From: SRK [mailto:swethakasire...@gmail.com]
Sent: Tuesday, February 9, 2016 1:58 PM
To: user@spark.apache.org
Subject: How to collect/take arbitrary number of records in the driver?



Hi ,



How to get a fixed amount of records from an RDD in Driver? Suppose I want the 
records from 100 to 1000 and then save them to some external database, I know 
that I can do it from Workers in partition but I want to avoid that for some 
reasons. The idea is to collect the data to driver and save, although slowly.



I am looking for something like take(100, 1000)  or take (1000,2000)



Thanks,

Swetha







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-collect-take-arbitrary-number-of-records-in-the-driver-tp26184.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org




Re: Spark with .NET

2016-02-09 Thread Ted Yu
bq. it is a .NET assembly and not really used by SparkCLR

Then maybe drop the import ?

I was searching the SparkCLR repo to see whether (Spark) DataSet is
supported.

Cheer

On Tue, Feb 9, 2016 at 3:07 PM, skaarthik oss 
wrote:

> *Arko* – you could use the following links to get started with SparkCLR
> API and use C# with Spark for DataFrame processing. If you need the support
> for interactive scenario, please feel free to share your scenario and
> requirements to the SparkCLR project. Interactive scenario is one of the
> focus areas of the current milestone in SparkCLR project.
>
> ·
> https://github.com/Microsoft/SparkCLR/blob/master/examples/JdbcDataFrame/Program.cs
>
> ·
> https://github.com/Microsoft/SparkCLR/blob/master/csharp/Samples/Microsoft.Spark.CSharp/DataFrameSamples.cs
>
>
>
>
>
> *Ted* – System.Data.DataSetExtensions is a reference that is
> automatically added when a C# project is created in Visual Studio. As
> Silvio pointed out below, it is a .NET assembly and not really used by
> SparkCLR.
>
>
>
> *From:* Silvio Fiorito [mailto:silvio.fior...@granturing.com]
> *Sent:* Tuesday, February 9, 2016 1:31 PM
> *To:* Ted Yu ; Bryan Jeffrey  >
> *Cc:* Arko Provo Mukherjee ; user <
> user@spark.apache.org>
>
> *Subject:* Re: Spark with .NET
>
>
>
> That’s just a .NET assembly (not related to Spark DataSets) but doesn’t
> look like they’re actually using it. It’s typically a default reference
> pulled in by the project templates.
>
>
>
> The code though is available from Mono here:
> https://github.com/mono/mono/tree/master/mcs/class/System.Data.DataSetExtensions
>
>
>
> *From: *Ted Yu 
> *Date: *Tuesday, February 9, 2016 at 3:56 PM
> *To: *Bryan Jeffrey 
> *Cc: *Arko Provo Mukherjee , user <
> user@spark.apache.org>
> *Subject: *Re: Spark with .NET
>
>
>
> Looks like they have some system support whose source is not in the repo:
>
> 
>
>
>
> FYI
>
>
>
> On Tue, Feb 9, 2016 at 12:17 PM, Bryan Jeffrey 
> wrote:
>
> Arko,
>
>
> Check this out: https://github.com/Microsoft/SparkCLR
>
>
>
> This is a Microsoft authored C# language binding for Spark.
>
>
>
> Regards,
>
>
>
> Bryan Jeffrey
>
>
>
> On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
> Doesn't seem to be supported, but thanks! I will probably write some .NET
> wrapper in my front end and use the java api in the backend.
>
> Warm regards
>
> Arko
>
>
>
>
>
> On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu  wrote:
>
> This thread is related:
>
> http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+
>
>
>
> On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
> Hello,
>
>
>
> I want to use Spark (preferable Spark SQL) using C#. Anyone has any
> pointers to that?
>
> Thanks & regards
>
> Arko
>
>
>
>
>
>
>
>
>
>
>


RE: Spark with .NET

2016-02-09 Thread skaarthik oss
Arko – you could use the following links to get started with SparkCLR API and 
use C# with Spark for DataFrame processing. If you need the support for 
interactive scenario, please feel free to share your scenario and requirements 
to the SparkCLR project. Interactive scenario is one of the focus areas of the 
current milestone in SparkCLR project.

· 
https://github.com/Microsoft/SparkCLR/blob/master/examples/JdbcDataFrame/Program.cs

· 
https://github.com/Microsoft/SparkCLR/blob/master/csharp/Samples/Microsoft.Spark.CSharp/DataFrameSamples.cs

 

 

Ted – System.Data.DataSetExtensions is a reference that is automatically added 
when a C# project is created in Visual Studio. As Silvio pointed out below, it 
is a .NET assembly and not really used by SparkCLR.

 

From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] 
Sent: Tuesday, February 9, 2016 1:31 PM
To: Ted Yu ; Bryan Jeffrey 
Cc: Arko Provo Mukherjee ; user 

Subject: Re: Spark with .NET

 

That’s just a .NET assembly (not related to Spark DataSets) but doesn’t look 
like they’re actually using it. It’s typically a default reference pulled in by 
the project templates.

 

The code though is available from Mono here: 
https://github.com/mono/mono/tree/master/mcs/class/System.Data.DataSetExtensions

 

From: Ted Yu  >
Date: Tuesday, February 9, 2016 at 3:56 PM
To: Bryan Jeffrey  >
Cc: Arko Provo Mukherjee  >, user  >
Subject: Re: Spark with .NET

 

Looks like they have some system support whose source is not in the repo:



 

FYI

 

On Tue, Feb 9, 2016 at 12:17 PM, Bryan Jeffrey  > wrote:

Arko,


Check this out: https://github.com/Microsoft/SparkCLR

 

This is a Microsoft authored C# language binding for Spark.

 

Regards,

 

Bryan Jeffrey

 

On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee 
 > wrote:

Doesn't seem to be supported, but thanks! I will probably write some .NET 
wrapper in my front end and use the java api in the backend. 

Warm regards

Arko

 

 

On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu  > wrote:

This thread is related:

http://search-hadoop.com/m/q3RTtwp4nR1lugin1 
 
=+NET+on+Apache+Spark+

 

On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee 
 > wrote:

Hello, 

 

I want to use Spark (preferable Spark SQL) using C#. Anyone has any pointers to 
that? 

Thanks & regards

Arko

 

 

 

 

 



Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-09 Thread Shixiong(Ryan) Zhu
Could you do a thread dump in the executor that runs the Kinesis receiver
and post it? It would be great if you can provide the executor log as well?

On Tue, Feb 9, 2016 at 3:14 PM, Roberto Coluccio  wrote:

> Hello,
>
> can anybody kindly help me out a little bit here? I just verified the
> problem is still there on Spark 1.6.0 and emr-4.3.0 as well. It's
> definitely a Kinesis-related issue, since with Spark 1.6.0 I'm successfully
> able to get Streaming drivers to terminate with no issue IF I don't use
> Kinesis and open any Receivers.
>
> Thank you!
>
> Roberto
>
>
> On Tue, Feb 2, 2016 at 4:40 PM, Roberto Coluccio <
> roberto.coluc...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm struggling around an issue ever since I tried to upgrade my Spark
>> Streaming solution from 1.4.1 to 1.5+.
>>
>> I have a Spark Streaming app which creates 3 ReceiverInputDStreams
>> leveraging KinesisUtils.createStream API.
>>
>> I used to leverage a timeout to terminate my app
>> (StreamingContext.awaitTerminationOrTimeout(timeout)) gracefully (SparkConf
>> spark.streaming.stopGracefullyOnShutdown=true).
>>
>> I used to submit my Spark app on EMR in yarn-cluster mode.
>>
>> Everything worked fine up to Spark 1.4.1 (on EMR AMI 3.9).
>>
>> Since I upgraded (tried with Spark 1.5.2 on emr-4.2.0 and Spark 1.6.0 on
>> emr-4.3.0) I can't get the app to actually terminate. Logs tells me it
>> tries to, but no confirmation of receivers stop is retrieved. Instead, when
>> the timer gets to the next period, the StreamingContext continues its
>> processing for a while (then it gets killed with a SIGTERM 15. YARN's vmem
>> and pmem killls disabled).
>>
>> ...
>>
>> 16/02/02 21:22:08 INFO ApplicationMaster: Final app status: SUCCEEDED, 
>> exitCode: 0
>> 16/02/02 21:22:08 INFO StreamingContext: Invoking stop(stopGracefully=true) 
>> from shutdown hook
>> 16/02/02 21:22:08 INFO ReceiverTracker: Sent stop signal to all 3 receivers
>> 16/02/02 21:22:18 INFO ReceiverTracker: Waiting for receiver job to 
>> terminate gracefully
>> 16/02/02 21:22:52 INFO ContextCleaner: Cleaned shuffle 141
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
>> 172.31.3.140:50152 in memory (size: 23.9 KB, free: 2.1 GB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
>> ip-172-31-3-141.ec2.internal:41776 in memory (size: 23.9 KB, free: 1224.9 MB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
>> ip-172-31-3-140.ec2.internal:36295 in memory (size: 23.9 KB, free: 1224.0 MB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
>> ip-172-31-3-141.ec2.internal:56428 in memory (size: 23.9 KB, free: 1224.9 MB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
>> ip-172-31-3-140.ec2.internal:50542 in memory (size: 23.9 KB, free: 1224.7 MB)
>> 16/02/02 21:22:52 INFO ContextCleaner: Cleaned accumulator 184
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
>> 172.31.3.140:50152 in memory (size: 3.0 KB, free: 2.1 GB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
>> ip-172-31-3-141.ec2.internal:41776 in memory (size: 3.0 KB, free: 1224.9 MB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
>> ip-172-31-3-141.ec2.internal:56428 in memory (size: 3.0 KB, free: 1224.9 MB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
>> ip-172-31-3-140.ec2.internal:36295 in memory (size: 3.0 KB, free: 1224.0 MB)
>> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
>> ip-172-31-3-140.ec2.internal:50542 in memory (size: 3.0 KB, free: 1224.7 MB)
>> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 680 for time 145444830 
>> ms for checkpointing
>> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 708 for time 145444830 
>> ms for checkpointing
>> 16/02/02 21:25:00 INFO TransformedDStream: Slicing from 145444800 ms to 
>> 145444830 ms (aligned to 145444800 ms and 145444830 ms)
>> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 777 for time 145444830 
>> ms for checkpointing
>> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 801 for time 145444830 
>> ms for checkpointing
>> 16/02/02 21:25:00 INFO JobScheduler: Added jobs for time 145444830 ms
>> 16/02/02 21:25:00 INFO JobGenerator: Checkpointing graph for time 
>> 145444830 ms
>> 16/02/02 21:25:00 INFO DStreamGraph: Updating checkpoint data for time 
>> 145444830 ms
>> 16/02/02 21:25:00 INFO JobScheduler: Starting job streaming job 
>> 145444830 ms.0 from job set of time 145444830 ms
>>
>> ...
>>
>>
>> Please, this is really blocking in the upgrade process to latest Spark
>> versions and I really don't know how to work it around.
>>
>> Any help would be very much appreciated.
>>
>> Thank you,
>>
>> Roberto
>>
>>
>>
>


spark-csv partitionBy

2016-02-09 Thread Srikanth
Hello,

I want to save Spark job result as LZO compressed CSV files partitioned by
one or more columns.
Given that partitionBy is not supported by spark-csv, is there any
recommendation for achieving this in user code?

One quick option is to
  i) cache the result dataframe
  ii) get unique partition keys
  iii) Iterate over keys and filter the result for that key

   rawDF.cache
   val idList =
rawDF.select($"ID").distinct.collect.toList.map(_.getLong(0))
  idList.foreach( id => {
val rows = rawDF.filter($"ID" === id)

rows.write.format("com.databricks.spark.csv").save(s"hdfs:///output/id=$id/")
  })

This approach doesn't scale well. Especially since no.of unique IDs can be
between 500-700.
And adding a second partition column will make this even worst.

Wondering if anyone has an efficient work around?

Srikanth


Re: Appropriate Apache Users List Uses

2016-02-09 Thread Pierce Lamb
I sent this mail. It was not automated or part of a mass email.

My apologies for misuse.

Pierce

On Tue, Feb 9, 2016 at 12:02 PM, u...@moosheimer.com 
wrote:

> I wouldn't expect this either.
> Very disappointing...
>
> -Kay-Uwe Moosheimer
>
> Am 09.02.2016 um 20:53 schrieb Ryan Victory :
>
> Yeah, a little disappointed with this, I wouldn't expect to be sent
> unsolicited mail based on my membership to this list.
>
> -Ryan Victory
>
> On Tue, Feb 9, 2016 at 1:36 PM, John Omernik  wrote:
>
>> All, I received this today, is this appropriate list use? Note: This was
>> unsolicited.
>>
>> Thanks
>> John
>>
>>
>>
>> From: Pierce Lamb 
>> 11:57 AM (1 hour ago)
>> to me
>>
>> Hi John,
>>
>> I saw you on the Spark Mailing List and noticed you worked for * and
>> wanted to reach out. My company, SnappyData, just launched an open source
>> OLTP + OLAP Database built on Spark. Our lead investor is Pivotal, whose
>> largest owner is EMC which makes * like a father figure :)
>>
>> SnappyData’s goal is two fold: Operationalize Spark and deliver truly
>> interactive queries. To do this, we first integrated Spark with an
>> in-memory database with a pedigree of production customer deployments:
>> GemFireXD (GemXD).
>>
>> GemXD operationalized Spark via:
>>
>> -- True high availability
>>
>> -- A highly concurrent environment
>>
>> -- An OLTP engine that can process transactions (mutable state)
>>
>> With GemXD as a storage engine, we packaged SnappyData with Approximate
>> Query Processing (AQP) technology. AQP enables interactive response times
>> even when data volumes are huge because it allows the developer to trade
>> latency for accuracy. AQP queries (SQL queries with a specified error rate)
>> execute on sample tables -- tables that have taken a stratified sample of
>> the full dataset. As such, AQP queries enable much faster decisions when
>> 100% accuracy isn’t needed and sample tables require far fewer resources to
>> manage.
>>
>> If that sounds interesting to you, please check out our Github repo (our
>> release is hosted there under “releases”):
>>
>> https://github.com/SnappyDataInc/snappydata
>>
>> We also have a technical paper that dives into the architecture:
>> http://www.snappydata.io/snappy-industrial
>>
>> Are you currently using Spark at ? I’d love to set up a call with you
>> and hear about how you’re using it and see if SnappyData could be a fit.
>>
>> In addition to replying to this email, there are many ways to chat with
>> us: https://github.com/SnappyDataInc/snappydata#community-support
>>
>> Hope to hear from you,
>>
>> Pierce
>>
>> pl...@snappydata.io
>>
>> http://www.twitter.com/snappydata
>>
>
>


Re: Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
Ted,

Thank you for the pointer. That works, but what does a string prepended
with $ sign mean? Is it an expression?

Could you also help me with the select() parameter syntax? I followed
something similar $"a.x" and it gives an error message that a TypedColumn
is expected.

Regards,
Raghava.


On Tue, Feb 9, 2016 at 10:12 AM, Ted Yu  wrote:

> Please take a look at:
> sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
>
> val ds1 = Seq(1, 2, 3).toDS().as("a")
> val ds2 = Seq(1, 2).toDS().as("b")
>
> checkAnswer(
>   ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"),
>
> On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello All,
>>
>> joinWith() method in Dataset takes a condition of type Column. Without
>> converting a Dataset to a DataFrame, how can we get a specific column?
>>
>> For eg: case class Pair(x: Long, y: Long)
>>
>> A, B are Datasets of type Pair and I want to join A.x with B.y
>>
>> A.joinWith(B, A.toDF().col("x") == B.toDF().col("y"))
>>
>> Is there a way to avoid using toDF()?
>>
>> I am having similar issues with the usage of filter(A.x == B.y)
>>
>> --
>> Regards,
>> Raghava
>>
>
>


-- 
Regards,
Raghava
http://raghavam.github.io


Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-09 Thread Roberto Coluccio
Hello,

can anybody kindly help me out a little bit here? I just verified the
problem is still there on Spark 1.6.0 and emr-4.3.0 as well. It's
definitely a Kinesis-related issue, since with Spark 1.6.0 I'm successfully
able to get Streaming drivers to terminate with no issue IF I don't use
Kinesis and open any Receivers.

Thank you!

Roberto


On Tue, Feb 2, 2016 at 4:40 PM, Roberto Coluccio  wrote:

> Hi,
>
> I'm struggling around an issue ever since I tried to upgrade my Spark
> Streaming solution from 1.4.1 to 1.5+.
>
> I have a Spark Streaming app which creates 3 ReceiverInputDStreams
> leveraging KinesisUtils.createStream API.
>
> I used to leverage a timeout to terminate my app
> (StreamingContext.awaitTerminationOrTimeout(timeout)) gracefully (SparkConf
> spark.streaming.stopGracefullyOnShutdown=true).
>
> I used to submit my Spark app on EMR in yarn-cluster mode.
>
> Everything worked fine up to Spark 1.4.1 (on EMR AMI 3.9).
>
> Since I upgraded (tried with Spark 1.5.2 on emr-4.2.0 and Spark 1.6.0 on
> emr-4.3.0) I can't get the app to actually terminate. Logs tells me it
> tries to, but no confirmation of receivers stop is retrieved. Instead, when
> the timer gets to the next period, the StreamingContext continues its
> processing for a while (then it gets killed with a SIGTERM 15. YARN's vmem
> and pmem killls disabled).
>
> ...
>
> 16/02/02 21:22:08 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 16/02/02 21:22:08 INFO StreamingContext: Invoking stop(stopGracefully=true) 
> from shutdown hook
> 16/02/02 21:22:08 INFO ReceiverTracker: Sent stop signal to all 3 receivers
> 16/02/02 21:22:18 INFO ReceiverTracker: Waiting for receiver job to terminate 
> gracefully
> 16/02/02 21:22:52 INFO ContextCleaner: Cleaned shuffle 141
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
> 172.31.3.140:50152 in memory (size: 23.9 KB, free: 2.1 GB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
> ip-172-31-3-141.ec2.internal:41776 in memory (size: 23.9 KB, free: 1224.9 MB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
> ip-172-31-3-140.ec2.internal:36295 in memory (size: 23.9 KB, free: 1224.0 MB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
> ip-172-31-3-141.ec2.internal:56428 in memory (size: 23.9 KB, free: 1224.9 MB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0 on 
> ip-172-31-3-140.ec2.internal:50542 in memory (size: 23.9 KB, free: 1224.7 MB)
> 16/02/02 21:22:52 INFO ContextCleaner: Cleaned accumulator 184
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
> 172.31.3.140:50152 in memory (size: 3.0 KB, free: 2.1 GB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
> ip-172-31-3-141.ec2.internal:41776 in memory (size: 3.0 KB, free: 1224.9 MB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
> ip-172-31-3-141.ec2.internal:56428 in memory (size: 3.0 KB, free: 1224.9 MB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
> ip-172-31-3-140.ec2.internal:36295 in memory (size: 3.0 KB, free: 1224.0 MB)
> 16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0 on 
> ip-172-31-3-140.ec2.internal:50542 in memory (size: 3.0 KB, free: 1224.7 MB)
> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 680 for time 145444830 
> ms for checkpointing
> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 708 for time 145444830 
> ms for checkpointing
> 16/02/02 21:25:00 INFO TransformedDStream: Slicing from 145444800 ms to 
> 145444830 ms (aligned to 145444800 ms and 145444830 ms)
> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 777 for time 145444830 
> ms for checkpointing
> 16/02/02 21:25:00 INFO StateDStream: Marking RDD 801 for time 145444830 
> ms for checkpointing
> 16/02/02 21:25:00 INFO JobScheduler: Added jobs for time 145444830 ms
> 16/02/02 21:25:00 INFO JobGenerator: Checkpointing graph for time 
> 145444830 ms
> 16/02/02 21:25:00 INFO DStreamGraph: Updating checkpoint data for time 
> 145444830 ms
> 16/02/02 21:25:00 INFO JobScheduler: Starting job streaming job 145444830 
> ms.0 from job set of time 145444830 ms
>
> ...
>
>
> Please, this is really blocking in the upgrade process to latest Spark
> versions and I really don't know how to work it around.
>
> Any help would be very much appreciated.
>
> Thank you,
>
> Roberto
>
>
>


Re: Spark with .NET

2016-02-09 Thread Arko Provo Mukherjee
Hello,

Thanks much for your help, much helpful! Let me explore some of the stuff
suggested :)

Thanks & regards
Arko


On Tue, Feb 9, 2016 at 3:17 PM, Ted Yu  wrote:

> bq. it is a .NET assembly and not really used by SparkCLR
>
> Then maybe drop the import ?
>
> I was searching the SparkCLR repo to see whether (Spark) DataSet is
> supported.
>
> Cheer
>
> On Tue, Feb 9, 2016 at 3:07 PM, skaarthik oss 
> wrote:
>
>> *Arko* – you could use the following links to get started with SparkCLR
>> API and use C# with Spark for DataFrame processing. If you need the support
>> for interactive scenario, please feel free to share your scenario and
>> requirements to the SparkCLR project. Interactive scenario is one of the
>> focus areas of the current milestone in SparkCLR project.
>>
>> ·
>> https://github.com/Microsoft/SparkCLR/blob/master/examples/JdbcDataFrame/Program.cs
>>
>> ·
>> https://github.com/Microsoft/SparkCLR/blob/master/csharp/Samples/Microsoft.Spark.CSharp/DataFrameSamples.cs
>>
>>
>>
>>
>>
>> *Ted* – System.Data.DataSetExtensions is a reference that is
>> automatically added when a C# project is created in Visual Studio. As
>> Silvio pointed out below, it is a .NET assembly and not really used by
>> SparkCLR.
>>
>>
>>
>> *From:* Silvio Fiorito [mailto:silvio.fior...@granturing.com]
>> *Sent:* Tuesday, February 9, 2016 1:31 PM
>> *To:* Ted Yu ; Bryan Jeffrey <
>> bryan.jeff...@gmail.com>
>> *Cc:* Arko Provo Mukherjee ; user <
>> user@spark.apache.org>
>>
>> *Subject:* Re: Spark with .NET
>>
>>
>>
>> That’s just a .NET assembly (not related to Spark DataSets) but doesn’t
>> look like they’re actually using it. It’s typically a default reference
>> pulled in by the project templates.
>>
>>
>>
>> The code though is available from Mono here:
>> https://github.com/mono/mono/tree/master/mcs/class/System.Data.DataSetExtensions
>>
>>
>>
>> *From: *Ted Yu 
>> *Date: *Tuesday, February 9, 2016 at 3:56 PM
>> *To: *Bryan Jeffrey 
>> *Cc: *Arko Provo Mukherjee , user <
>> user@spark.apache.org>
>> *Subject: *Re: Spark with .NET
>>
>>
>>
>> Looks like they have some system support whose source is not in the repo:
>>
>> 
>>
>>
>>
>> FYI
>>
>>
>>
>> On Tue, Feb 9, 2016 at 12:17 PM, Bryan Jeffrey 
>> wrote:
>>
>> Arko,
>>
>>
>> Check this out: https://github.com/Microsoft/SparkCLR
>>
>>
>>
>> This is a Microsoft authored C# language binding for Spark.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Bryan Jeffrey
>>
>>
>>
>> On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee <
>> arkoprovomukher...@gmail.com> wrote:
>>
>> Doesn't seem to be supported, but thanks! I will probably write some .NET
>> wrapper in my front end and use the java api in the backend.
>>
>> Warm regards
>>
>> Arko
>>
>>
>>
>>
>>
>> On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu  wrote:
>>
>> This thread is related:
>>
>> http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+
>>
>>
>>
>> On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee <
>> arkoprovomukher...@gmail.com> wrote:
>>
>> Hello,
>>
>>
>>
>> I want to use Spark (preferable Spark SQL) using C#. Anyone has any
>> pointers to that?
>>
>> Thanks & regards
>>
>> Arko
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>


How to use a register temp table inside mapPartitions of an RDD

2016-02-09 Thread SRK
hi,

How to use a registerTempTable to register an RDD as a temporary table and
use it inside mapPartitions of a different RDD?


Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-register-temp-table-inside-mapPartitions-of-an-RDD-tp26187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
The issue is not almost solved even in newer Spark.


On Wed, Feb 10, 2016 at 1:36 AM, Cesar Flores  wrote:

> Well, actually I am observing a single partition no matter what my input
> is. I am using spark 1.3.1.
>
> For what you both are saying, it appears that this sorting issue (going to
> a single partition after applying orderBy in a DF) is solved in later
> version of Spark? Well, if that is the case, I guess I just need to wait
> until my workplace decides to update.
>
>
> Thanks a lot
>
> On Tue, Feb 9, 2016 at 9:39 AM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
>> `HashPartitioning`.
>> `RangePartitioning` roughly samples input data and internally computes
>> partition bounds
>> to split given rows into `spark.sql.shuffle.partitions` partitions.
>> Therefore, when sort keys are highly skewed, I think some partitions
>> could end up being empty
>> (that is, # of result partitions is lower than `spark.sql.shuffle.partitions`
>> .
>>
>>
>> On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat 
>> wrote:
>>
>>> For sql shuffle operations like groupby, the number of output partitions
>>> is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does
>>> not honour this.
>>>
>>> In my small test, I could see that the number of partitions  in DF
>>> returned by orderBy was equal to the total number of distinct keys. Are you
>>> observing the same, I mean do you have a single value for all rows in the
>>> column on which you are running orderBy? If yes, you are better off not
>>> running the orderBy clause.
>>>
>>> May be someone from spark sql team could answer that how should the
>>> partitioning of the output DF be handled when doing an orderBy?
>>>
>>> Hemant
>>> www.snappydata.io
>>> https://github.com/SnappyDataInc/snappydata
>>>
>>>
>>>
>>>
>>> On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores  wrote:
>>>

 I have a data frame which I sort using orderBy function. This operation
 causes my data frame to go to a single partition. After using those
 results, I would like to re-partition to a larger number of partitions.
 Currently I am just doing:

 val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
 partition and around 14 million records
 val newDF =  hc.createDataFrame(rdd, df.schema)

 This process is really slow. Is there any other way of achieving this
 task, or to optimize it (perhaps tweaking a spark configuration parameter)?


 Thanks a lot
 --
 Cesar Flores

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Cesar Flores
>



-- 
---
Takeshi Yamamuro


AM creation in yarn client mode

2016-02-09 Thread praveen S
Hi,

I have 2 questions when running the spark jobs on yarn in client mode :

1) Where is the AM(application master) created :

A) is it created on the client where the job was submitted? i.e driver and
AM on the same client?
Or
B) yarn decides where the the AM should be created?

2) Driver and AM run in different processes : is my assumption correct?

Regards,
Praveen


Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
Ohk. I was comparing groupBy with orderBy and now I realize that they are
using different partitioning schemes.

Thanks Takeshi.



On Tue, Feb 9, 2016 at 9:09 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of
> `HashPartitioning`.
> `RangePartitioning` roughly samples input data and internally computes
> partition bounds
> to split given rows into `spark.sql.shuffle.partitions` partitions.
> Therefore, when sort keys are highly skewed, I think some partitions could
> end up being empty
> (that is, # of result partitions is lower than `spark.sql.shuffle.partitions`
> .
>
>
> On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat 
> wrote:
>
>> For sql shuffle operations like groupby, the number of output partitions
>> is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does
>> not honour this.
>>
>> In my small test, I could see that the number of partitions  in DF
>> returned by orderBy was equal to the total number of distinct keys. Are you
>> observing the same, I mean do you have a single value for all rows in the
>> column on which you are running orderBy? If yes, you are better off not
>> running the orderBy clause.
>>
>> May be someone from spark sql team could answer that how should the
>> partitioning of the output DF be handled when doing an orderBy?
>>
>> Hemant
>> www.snappydata.io
>> https://github.com/SnappyDataInc/snappydata
>>
>>
>>
>>
>> On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores  wrote:
>>
>>>
>>> I have a data frame which I sort using orderBy function. This operation
>>> causes my data frame to go to a single partition. After using those
>>> results, I would like to re-partition to a larger number of partitions.
>>> Currently I am just doing:
>>>
>>> val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single
>>> partition and around 14 million records
>>> val newDF =  hc.createDataFrame(rdd, df.schema)
>>>
>>> This process is really slow. Is there any other way of achieving this
>>> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>>>
>>>
>>> Thanks a lot
>>> --
>>> Cesar Flores
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: How to use a register temp table inside mapPartitions of an RDD

2016-02-09 Thread Koert Kuipers
if you mean to both register and use the table while you are inside
mapPartition, i do not think that is possible or advisable. can you join
the data? or broadcast it?

On Tue, Feb 9, 2016 at 8:22 PM, SRK  wrote:

> hi,
>
> How to use a registerTempTable to register an RDD as a temporary table and
> use it inside mapPartitions of a different RDD?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-register-temp-table-inside-mapPartitions-of-an-RDD-tp26187.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-09 Thread Li Ming Tsai
Hi,


It looks like Kmeans++ is slow 
(SPARK-3424) in the 
initialisation phase and is local to driver using 1 core only.


If I use random, the job completed in 1.5mins compared to 1hr+.


Should I move this to the dev list?


Regards,

Liming



From: Li Ming Tsai 
Sent: Sunday, February 7, 2016 10:03 AM
To: user@spark.apache.org
Subject: Re: Slowness in Kmeans calculating fastSquaredDistance


Hi,


I did more investigation and found out that BLAS.scala is calling the native 
reference architecture (f2jblas) for level 1 routines.


I even patched it to use nativeBlas.ddot but it has no material impact.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala#L126


private def dot(x: DenseVector, y: DenseVector): Double = {

val n = x.size

f2jBLAS.ddot(n, x.values, 1, y.values, 1)

  }


Maybe Xiangrui can comment on this?




From: Li Ming Tsai 
Sent: Friday, February 5, 2016 10:56 AM
To: user@spark.apache.org
Subject: Slowness in Kmeans calculating fastSquaredDistance


Hi,


I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl 
flag.


I am using spark local[4] mode and I run it like this:
# export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64
# bin/spark-shell ...

I have also added the following to /opt/intel/mkl/lib/intel64:
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so.3 -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so.3 -> libmkl_rt.so


I believe (???) that I'm using Intel MKL because the warnings went away:

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS

After collectAsMap, there is no progress but I can observe that only 1 CPU is 
being utilised with the following stack trace:

"ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 
nid=0xbdc runnable [0x7fbf12205000]

   java.lang.Thread.State: RUNNABLE

at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111)

at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349)

at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555)


This last few steps takes more than half of the total time for a 1Mx100 dataset.


The code is just:

val clusters = KMeans.train(parsedData, 1000, 1)


Shouldn't it utilising all the cores for the dot product? Is this a 
misconfiguration?


Thanks!




Re: AM creation in yarn client mode

2016-02-09 Thread ayan guha
It depends on yarn-cluster and yarn-client mode.

On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:

> Hi,
>
> I have 2 questions when running the spark jobs on yarn in client mode :
>
> 1) Where is the AM(application master) created :
>
> A) is it created on the client where the job was submitted? i.e driver and
> AM on the same client?
> Or
> B) yarn decides where the the AM should be created?
>
> 2) Driver and AM run in different processes : is my assumption correct?
>
> Regards,
> Praveen
>



-- 
Best Regards,
Ayan Guha


Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard

Hi Mohammed

I'm aware of that documentation, what are you hinting at specifically?  
I'm pushing all elements of the partition key, so that should work. As  
user zero323 on SO pointed out it the problem is most probably related  
to the dynamic nature of the predicate elements (two distributed  
collections per filter per join).


The statement "To push down partition keys, all of them must be  
included, but not more than one predicate per partition key, otherwise  
nothing is pushed down."


Does not apply IMO?

Bernhard

Quoting Mohammed Guller :


Hi Bernhard,

Take a look at the examples shown under the "Pushing down clauses to  
Cassandra" sections on this page:


https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md


Mohammed
Author: Big Data Analytics with Spark

-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:05 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

Thanks for hint, I should probably do that :)

As for the DF singleton:

/**
  * Lazily instantiated singleton instance of base_data DataFrame
  */
object base_data_df {

   @transient private var instance: DataFrame = _

   def getInstance(sqlContext: SQLContext): DataFrame = {
 if (instance == null) {
   // Load DataFrame with C* data-source
   instance = sqlContext.read
 .format("org.apache.spark.sql.cassandra")
 .options(Map("table" -> "cf", "keyspace" -> "ks"))
 .load()
 }
 instance
   }
}

Bernhard

Quoting Mohammed Guller :


You may have better luck with this question on the Spark Cassandra
Connector mailing list.



One quick question about this code from your email:

   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



What exactly is base_data_df and how are you creating it?

Mohammed
Author: Big Data Analytics with
Spark



-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 6:58 AM
To: user@spark.apache.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames



All,



I'm new to Spark and I'm having a hard time doing a simple join of two
DFs



Intent:

-  I'm receiving data from Kafka via direct stream and would like to
enrich the messages with data from Cassandra. The Kafka messages

(Protobufs) are decoded into DataFrames and then joined with a
(supposedly pre-filtered) DF from Cassandra. The relation of (Kafka)
streaming batch size to raw C* data is [several streaming messages to
millions of C* rows], BUT the join always yields exactly ONE result
[1:1] per message. After the join the resulting DF is eventually
stored to another C* table.



Problem:

- Even though I'm joining the two DFs on the full Cassandra primary
key and pushing the corresponding filter to C*, it seems that Spark is
loading the whole C* data-set into memory before actually joining
(which I'd like to prevent by using the filter/predicate pushdown).

This leads to a lot of shuffling and tasks being spawned, hence the
"simple" join takes forever...



Could anyone shed some light on this? In my perception this should be
a prime-example for DFs and Spark Streaming.



Environment:

- Spark 1.6

- Cassandra 2.1.12

- Cassandra-Spark-Connector 1.5-RC1

- Kafka 0.8.2.2



Code:



def main(args: Array[String]) {

 val conf = new SparkConf()

   .setAppName("test")

   .set("spark.cassandra.connection.host", "xxx")

   .set("spark.cassandra.connection.keep_alive_ms", "3")

   .setMaster("local[*]")



 val ssc = new StreamingContext(conf, Seconds(10))

 ssc.sparkContext.setLogLevel("INFO")



 // Initialise Kafka

 val kafkaTopics = Set[String]("xxx")

 val kafkaParams = Map[String, String](

   "metadata.broker.list" ->
"xxx:32000,xxx:32000,xxx:32000,xxx:32000",

   "auto.offset.reset" -> "smallest")



 // Kafka stream

 val messages = KafkaUtils.createDirectStream[String, MyMsg,
StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)



 // Executed on the driver

 messages.foreachRDD { rdd =>



   // Create an instance of SQLContext

   val sqlContext =
SQLContextSingleton.getInstance(rdd.sparkContext)

   import sqlContext.implicits._



   // Map MyMsg RDD

   val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}



   // Convert RDD[MyMsg] to DataFrame

   val MyMsgDf = MyMsgRdd.toDF()

.select(

 $"prim1Id" as 'prim1_id,

 $"prim2Id" as 'prim2_id,

 $...

   )



   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



   // Inner join on prim1Id and prim2Id

  

Pyspark - how to use UDFs with dataframe groupby

2016-02-09 Thread Viktor ARDELEAN
Hello,

I am using following transformations on RDD:

rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\
   .aggregateByKey([], lambda accumulatorList, value:
accumulatorList + [value], lambda list1, list2: [list1] + [list2])

I want to use the dataframe groupBy + agg transformation instead of
map + aggregateByKey because as far as I know dataframe
transformations are faster than RDD transformations.

I just can't figure out how to use custom aggregate functions with agg.

*First step is clear:*

groupedData = df.groupBy("a","b","c")

*Second step is not very clear to me:*

dfAgg = groupedData.agg()

The agg documentations says the following:
agg(**exprs*)


Compute aggregates and returns the result as a DataFrame

.

The available aggregate functions are avg, max, min, sum, count.

If exprs is a single dict mapping from string to string, then the key is
the column to perform aggregation on, and the value is the aggregate
function.

Alternatively, exprs can also be a list of aggregate Column

 expressions.
Parameters: *exprs* – a dict mapping from column name (string) to aggregate
functions (string), or a list of Column

.

Thanks for help!
-- 
Viktor

*P*   Don't print this email, unless it's really necessary. Take care of
the environment.


Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
The filter in the join is re-arranged in the DAG (from what I can tell  
--> explain/UI) and should therefore be pushed accordingly. I also  
made experiments applying the filter to base_data before the join  
explicitly, effectively creating a new DF, but no luck either.



Quoting Mohammed Guller :

Moving the spark mailing list to BCC since this is not really  
related to Spark.


May be I am missing something, but where are you calling the filter  
method on the base_data DF to push down the predicates to Cassandra  
before calling the join method?


Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:47 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

I'm aware of that documentation, what are you hinting at specifically?
I'm pushing all elements of the partition key, so that should work.  
As user zero323 on SO pointed out it the problem is most probably  
related to the dynamic nature of the predicate elements (two  
distributed collections per filter per join).


The statement "To push down partition keys, all of them must be  
included, but not more than one predicate per partition key,  
otherwise nothing is pushed down."


Does not apply IMO?

Bernhard

Quoting Mohammed Guller :


Hi Bernhard,

Take a look at the examples shown under the "Pushing down clauses to
Cassandra" sections on this page:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/
14_data_frames.md


Mohammed
Author: Big Data Analytics with Spark

-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:05 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

Thanks for hint, I should probably do that :)

As for the DF singleton:

/**
  * Lazily instantiated singleton instance of base_data DataFrame
  */
object base_data_df {

   @transient private var instance: DataFrame = _

   def getInstance(sqlContext: SQLContext): DataFrame = {
 if (instance == null) {
   // Load DataFrame with C* data-source
   instance = sqlContext.read
 .format("org.apache.spark.sql.cassandra")
 .options(Map("table" -> "cf", "keyspace" -> "ks"))
 .load()
 }
 instance
   }
}

Bernhard

Quoting Mohammed Guller :


You may have better luck with this question on the Spark Cassandra
Connector mailing list.



One quick question about this code from your email:

   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



What exactly is base_data_df and how are you creating it?

Mohammed
Author: Big Data Analytics with
Spark



-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 6:58 AM
To: user@spark.apache.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames



All,



I'm new to Spark and I'm having a hard time doing a simple join of
two DFs



Intent:

-  I'm receiving data from Kafka via direct stream and would like to
enrich the messages with data from Cassandra. The Kafka messages

(Protobufs) are decoded into DataFrames and then joined with a
(supposedly pre-filtered) DF from Cassandra. The relation of (Kafka)
streaming batch size to raw C* data is [several streaming messages to
millions of C* rows], BUT the join always yields exactly ONE result
[1:1] per message. After the join the resulting DF is eventually
stored to another C* table.



Problem:

- Even though I'm joining the two DFs on the full Cassandra primary
key and pushing the corresponding filter to C*, it seems that Spark
is loading the whole C* data-set into memory before actually joining
(which I'd like to prevent by using the filter/predicate pushdown).

This leads to a lot of shuffling and tasks being spawned, hence the
"simple" join takes forever...



Could anyone shed some light on this? In my perception this should be
a prime-example for DFs and Spark Streaming.



Environment:

- Spark 1.6

- Cassandra 2.1.12

- Cassandra-Spark-Connector 1.5-RC1

- Kafka 0.8.2.2



Code:



def main(args: Array[String]) {

 val conf = new SparkConf()

   .setAppName("test")

   .set("spark.cassandra.connection.host", "xxx")

   .set("spark.cassandra.connection.keep_alive_ms", "3")

   .setMaster("local[*]")



 val ssc = new StreamingContext(conf, Seconds(10))

 ssc.sparkContext.setLogLevel("INFO")



 // Initialise Kafka

 val kafkaTopics = Set[String]("xxx")

 val kafkaParams = Map[String, String](

   "metadata.broker.list" ->
"xxx:32000,xxx:32000,xxx:32000,xxx:32000",

   

Turning on logging for internal Spark logs

2016-02-09 Thread Li Ming Tsai
Hi,


I have the default conf/log4j.properties:


log4j.rootCategory=INFO, console

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: 
%m%n


# Settings to quiet third party logs that are too verbose

log4j.logger.org.spark-project.jetty=WARN

log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR

log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO

log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

log4j.logger.org.apache.parquet=ERROR

log4j.logger.parquet=ERROR


# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive support

log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL

log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR


When running in local mode, logs from logInfo() did not appear, e.g. from 
kmeans.scala.


What are the steps to enable these logs?


Regards,

Liming


RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
Hi Bernhard,

Take a look at the examples shown under the "Pushing down clauses to Cassandra" 
sections on this page:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md


Mohammed
Author: Big Data Analytics with Spark

-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] 
Sent: Tuesday, February 9, 2016 10:05 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

Thanks for hint, I should probably do that :)

As for the DF singleton:

/**
  * Lazily instantiated singleton instance of base_data DataFrame
  */
object base_data_df {

   @transient private var instance: DataFrame = _

   def getInstance(sqlContext: SQLContext): DataFrame = {
 if (instance == null) {
   // Load DataFrame with C* data-source
   instance = sqlContext.read
 .format("org.apache.spark.sql.cassandra")
 .options(Map("table" -> "cf", "keyspace" -> "ks"))
 .load()
 }
 instance
   }
}

Bernhard

Quoting Mohammed Guller :

> You may have better luck with this question on the Spark Cassandra 
> Connector mailing list.
>
>
>
> One quick question about this code from your email:
>
>// Load DataFrame from C* data-source
>
>val base_data = base_data_df.getInstance(sqlContext)
>
>
>
> What exactly is base_data_df and how are you creating it?
>
> Mohammed
> Author: Big Data Analytics with
> Spark 1484209656/>
>
>
>
> -Original Message-
> From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
> Sent: Tuesday, February 9, 2016 6:58 AM
> To: user@spark.apache.org
> Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames
>
>
>
> All,
>
>
>
> I'm new to Spark and I'm having a hard time doing a simple join of two 
> DFs
>
>
>
> Intent:
>
> -  I'm receiving data from Kafka via direct stream and would like to 
> enrich the messages with data from Cassandra. The Kafka messages
>
> (Protobufs) are decoded into DataFrames and then joined with a 
> (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) 
> streaming batch size to raw C* data is [several streaming messages to 
> millions of C* rows], BUT the join always yields exactly ONE result 
> [1:1] per message. After the join the resulting DF is eventually 
> stored to another C* table.
>
>
>
> Problem:
>
> - Even though I'm joining the two DFs on the full Cassandra primary 
> key and pushing the corresponding filter to C*, it seems that Spark is 
> loading the whole C* data-set into memory before actually joining 
> (which I'd like to prevent by using the filter/predicate pushdown).
>
> This leads to a lot of shuffling and tasks being spawned, hence the 
> "simple" join takes forever...
>
>
>
> Could anyone shed some light on this? In my perception this should be 
> a prime-example for DFs and Spark Streaming.
>
>
>
> Environment:
>
> - Spark 1.6
>
> - Cassandra 2.1.12
>
> - Cassandra-Spark-Connector 1.5-RC1
>
> - Kafka 0.8.2.2
>
>
>
> Code:
>
>
>
> def main(args: Array[String]) {
>
>  val conf = new SparkConf()
>
>.setAppName("test")
>
>.set("spark.cassandra.connection.host", "xxx")
>
>.set("spark.cassandra.connection.keep_alive_ms", "3")
>
>.setMaster("local[*]")
>
>
>
>  val ssc = new StreamingContext(conf, Seconds(10))
>
>  ssc.sparkContext.setLogLevel("INFO")
>
>
>
>  // Initialise Kafka
>
>  val kafkaTopics = Set[String]("xxx")
>
>  val kafkaParams = Map[String, String](
>
>"metadata.broker.list" -> 
> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
>
>"auto.offset.reset" -> "smallest")
>
>
>
>  // Kafka stream
>
>  val messages = KafkaUtils.createDirectStream[String, MyMsg, 
> StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
>
>
>
>  // Executed on the driver
>
>  messages.foreachRDD { rdd =>
>
>
>
>// Create an instance of SQLContext
>
>val sqlContext = 
> SQLContextSingleton.getInstance(rdd.sparkContext)
>
>import sqlContext.implicits._
>
>
>
>// Map MyMsg RDD
>
>val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
>
>
>
>// Convert RDD[MyMsg] to DataFrame
>
>val MyMsgDf = MyMsgRdd.toDF()
>
> .select(
>
>  $"prim1Id" as 'prim1_id,
>
>  $"prim2Id" as 'prim2_id,
>
>  $...
>
>)
>
>
>
>// Load DataFrame from C* data-source
>
>val base_data = base_data_df.getInstance(sqlContext)
>
>
>
>// Inner join on prim1Id and prim2Id
>
>val joinedDf = MyMsgDf.join(base_data,
>
>  MyMsgDf("prim1_id") === base_data("prim1_id") &&
>
>  MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
>
>  .filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
>
>  && 

Re: AM creation in yarn-client mode

2016-02-09 Thread Alexander Pivovarov
the pictures to illustrate it
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_running_spark_on_yarn.html

On Tue, Feb 9, 2016 at 10:18 PM, Jonathan Kelly 
wrote:

> In yarn-client mode, the driver is separate from the AM. The AM is created
> in YARN, and YARN controls where it goes (though you can somewhat control
> it using YARN node labels--I just learned earlier today in a different
> thread on this list that this can be controlled by
> spark.yarn.am.labelExpression). Then what I understand is that the driver
> talks to the AM in order to request additional YARN containers in which to
> run executors.
>
> In yarn-cluster mode, the SparkSubmit process outside of the cluster
> creates the AM in YARN, and then what I understand is that the AM *becomes*
> the driver (by invoking the driver's main method), and then it requests the
> executor containers.
>
> So yes, one difference between yarn-client and yarn-cluster mode is that
> in yarn-client mode the driver and AM are separate, whereas they are the
> same in yarn-cluster.
>
> ~ Jonathan
>
> On Tue, Feb 9, 2016 at 9:57 PM praveen S  wrote:
>
>> Can you explain what happens in yarn client mode?
>>
>> Regards,
>> Praveen
>> On 10 Feb 2016 10:55, "ayan guha"  wrote:
>>
>>> It depends on yarn-cluster and yarn-client mode.
>>>
>>> On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:
>>>
 Hi,

 I have 2 questions when running the spark jobs on yarn in client mode :

 1) Where is the AM(application master) created :

 A) is it created on the client where the job was submitted? i.e driver
 and AM on the same client?
 Or
 B) yarn decides where the the AM should be created?

 2) Driver and AM run in different processes : is my assumption correct?

 Regards,
 Praveen

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>


Re: AM creation in yarn client mode

2016-02-09 Thread Diwakar Dhanuskodi
Your  2nd assumption  is  correct .
There  is  yarn client  which  polls AM while  running  in  yarn client mode 

Sent from Samsung Mobile.

 Original message From: ayan guha 
 Date:10/02/2016  10:55  (GMT+05:30) 
To: praveen S  Cc: user 
 Subject: Re: AM creation in yarn client mode 

It depends on yarn-cluster and yarn-client mode. 

On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:
Hi,

I have 2 questions when running the spark jobs on yarn in client mode :

1) Where is the AM(application master) created :

A) is it created on the client where the job was submitted? i.e driver and AM 
on the same client? 
Or 
B) yarn decides where the the AM should be created?

2) Driver and AM run in different processes : is my assumption correct?

Regards, 
Praveen




-- 
Best Regards,
Ayan Guha


RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
Moving the spark mailing list to BCC since this is not really related to Spark.

May be I am missing something, but where are you calling the filter method on 
the base_data DF to push down the predicates to Cassandra before calling the 
join method? 

Mohammed
Author: Big Data Analytics with Spark


-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] 
Sent: Tuesday, February 9, 2016 10:47 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

Hi Mohammed

I'm aware of that documentation, what are you hinting at specifically?  
I'm pushing all elements of the partition key, so that should work. As user 
zero323 on SO pointed out it the problem is most probably related to the 
dynamic nature of the predicate elements (two distributed collections per 
filter per join).

The statement "To push down partition keys, all of them must be included, but 
not more than one predicate per partition key, otherwise nothing is pushed 
down."

Does not apply IMO?

Bernhard

Quoting Mohammed Guller :

> Hi Bernhard,
>
> Take a look at the examples shown under the "Pushing down clauses to 
> Cassandra" sections on this page:
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/
> 14_data_frames.md
>
>
> Mohammed
> Author: Big Data Analytics with Spark
>
> -Original Message-
> From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
> Sent: Tuesday, February 9, 2016 10:05 PM
> To: Mohammed Guller
> Cc: user@spark.apache.org
> Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames
>
> Hi Mohammed
>
> Thanks for hint, I should probably do that :)
>
> As for the DF singleton:
>
> /**
>   * Lazily instantiated singleton instance of base_data DataFrame
>   */
> object base_data_df {
>
>@transient private var instance: DataFrame = _
>
>def getInstance(sqlContext: SQLContext): DataFrame = {
>  if (instance == null) {
>// Load DataFrame with C* data-source
>instance = sqlContext.read
>  .format("org.apache.spark.sql.cassandra")
>  .options(Map("table" -> "cf", "keyspace" -> "ks"))
>  .load()
>  }
>  instance
>}
> }
>
> Bernhard
>
> Quoting Mohammed Guller :
>
>> You may have better luck with this question on the Spark Cassandra 
>> Connector mailing list.
>>
>>
>>
>> One quick question about this code from your email:
>>
>>// Load DataFrame from C* data-source
>>
>>val base_data = base_data_df.getInstance(sqlContext)
>>
>>
>>
>> What exactly is base_data_df and how are you creating it?
>>
>> Mohammed
>> Author: Big Data Analytics with
>> Spark> /
>> 1484209656/>
>>
>>
>>
>> -Original Message-
>> From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
>> Sent: Tuesday, February 9, 2016 6:58 AM
>> To: user@spark.apache.org
>> Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames
>>
>>
>>
>> All,
>>
>>
>>
>> I'm new to Spark and I'm having a hard time doing a simple join of 
>> two DFs
>>
>>
>>
>> Intent:
>>
>> -  I'm receiving data from Kafka via direct stream and would like to 
>> enrich the messages with data from Cassandra. The Kafka messages
>>
>> (Protobufs) are decoded into DataFrames and then joined with a 
>> (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) 
>> streaming batch size to raw C* data is [several streaming messages to 
>> millions of C* rows], BUT the join always yields exactly ONE result 
>> [1:1] per message. After the join the resulting DF is eventually 
>> stored to another C* table.
>>
>>
>>
>> Problem:
>>
>> - Even though I'm joining the two DFs on the full Cassandra primary 
>> key and pushing the corresponding filter to C*, it seems that Spark 
>> is loading the whole C* data-set into memory before actually joining 
>> (which I'd like to prevent by using the filter/predicate pushdown).
>>
>> This leads to a lot of shuffling and tasks being spawned, hence the 
>> "simple" join takes forever...
>>
>>
>>
>> Could anyone shed some light on this? In my perception this should be 
>> a prime-example for DFs and Spark Streaming.
>>
>>
>>
>> Environment:
>>
>> - Spark 1.6
>>
>> - Cassandra 2.1.12
>>
>> - Cassandra-Spark-Connector 1.5-RC1
>>
>> - Kafka 0.8.2.2
>>
>>
>>
>> Code:
>>
>>
>>
>> def main(args: Array[String]) {
>>
>>  val conf = new SparkConf()
>>
>>.setAppName("test")
>>
>>.set("spark.cassandra.connection.host", "xxx")
>>
>>.set("spark.cassandra.connection.keep_alive_ms", "3")
>>
>>.setMaster("local[*]")
>>
>>
>>
>>  val ssc = new StreamingContext(conf, Seconds(10))
>>
>>  ssc.sparkContext.setLogLevel("INFO")
>>
>>
>>
>>  // Initialise Kafka
>>
>>  val kafkaTopics = Set[String]("xxx")
>>
>>  val kafkaParams = Map[String, String](
>>
>>"metadata.broker.list" ->
>> 

Re: AM creation in yarn-client mode

2016-02-09 Thread praveen S
Can you explain what happens in yarn client mode?

Regards,
Praveen
On 10 Feb 2016 10:55, "ayan guha"  wrote:

> It depends on yarn-cluster and yarn-client mode.
>
> On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:
>
>> Hi,
>>
>> I have 2 questions when running the spark jobs on yarn in client mode :
>>
>> 1) Where is the AM(application master) created :
>>
>> A) is it created on the client where the job was submitted? i.e driver
>> and AM on the same client?
>> Or
>> B) yarn decides where the the AM should be created?
>>
>> 2) Driver and AM run in different processes : is my assumption correct?
>>
>> Regards,
>> Praveen
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard

Hi Mohammed

Thanks for hint, I should probably do that :)

As for the DF singleton:

/**
 * Lazily instantiated singleton instance of base_data DataFrame
 */
object base_data_df {

  @transient private var instance: DataFrame = _

  def getInstance(sqlContext: SQLContext): DataFrame = {
if (instance == null) {
  // Load DataFrame with C* data-source
  instance = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "cf", "keyspace" -> "ks"))
.load()
}
instance
  }
}

Bernhard

Quoting Mohammed Guller :

You may have better luck with this question on the Spark Cassandra  
Connector mailing list.




One quick question about this code from your email:

   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



What exactly is base_data_df and how are you creating it?

Mohammed
Author: Big Data Analytics with  
Spark




-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 6:58 AM
To: user@spark.apache.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames



All,



I'm new to Spark and I'm having a hard time doing a simple join of two DFs



Intent:

-  I'm receiving data from Kafka via direct stream and would like to  
enrich the messages with data from Cassandra. The Kafka messages


(Protobufs) are decoded into DataFrames and then joined with a  
(supposedly pre-filtered) DF from Cassandra. The relation of (Kafka)  
streaming batch size to raw C* data is [several streaming messages  
to millions of C* rows], BUT the join always yields exactly ONE  
result [1:1] per message. After the join the resulting DF is  
eventually stored to another C* table.




Problem:

- Even though I'm joining the two DFs on the full Cassandra primary  
key and pushing the corresponding filter to C*, it seems that Spark  
is loading the whole C* data-set into memory before actually joining  
(which I'd like to prevent by using the filter/predicate pushdown).


This leads to a lot of shuffling and tasks being spawned, hence the  
"simple" join takes forever...




Could anyone shed some light on this? In my perception this should  
be a prime-example for DFs and Spark Streaming.




Environment:

- Spark 1.6

- Cassandra 2.1.12

- Cassandra-Spark-Connector 1.5-RC1

- Kafka 0.8.2.2



Code:



def main(args: Array[String]) {

 val conf = new SparkConf()

   .setAppName("test")

   .set("spark.cassandra.connection.host", "xxx")

   .set("spark.cassandra.connection.keep_alive_ms", "3")

   .setMaster("local[*]")



 val ssc = new StreamingContext(conf, Seconds(10))

 ssc.sparkContext.setLogLevel("INFO")



 // Initialise Kafka

 val kafkaTopics = Set[String]("xxx")

 val kafkaParams = Map[String, String](

   "metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",

   "auto.offset.reset" -> "smallest")



 // Kafka stream

 val messages = KafkaUtils.createDirectStream[String, MyMsg,  
StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)




 // Executed on the driver

 messages.foreachRDD { rdd =>



   // Create an instance of SQLContext

   val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)

   import sqlContext.implicits._



   // Map MyMsg RDD

   val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}



   // Convert RDD[MyMsg] to DataFrame

   val MyMsgDf = MyMsgRdd.toDF()

.select(

 $"prim1Id" as 'prim1_id,

 $"prim2Id" as 'prim2_id,

 $...

   )



   // Load DataFrame from C* data-source

   val base_data = base_data_df.getInstance(sqlContext)



   // Inner join on prim1Id and prim2Id

   val joinedDf = MyMsgDf.join(base_data,

 MyMsgDf("prim1_id") === base_data("prim1_id") &&

 MyMsgDf("prim2_id") === base_data("prim2_id"), "left")

 .filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))

 && base_data("prim2_id").isin(MyMsgDf("prim2_id")))



   joinedDf.show()

   joinedDf.printSchema()



   // Select relevant fields



   // Persist



 }



 // Start the computation

 ssc.start()

 ssc.awaitTermination()

}



SO:

http://stackoverflow.com/questions/35295182/joining-kafka-and-cassandra-dataframes-in-spark-streaming-ignores-c-predicate-p







-

To unsubscribe, e-mail:  
user-unsubscr...@spark.apache.org  
For additional commands, e-mail:  
user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: AM creation in yarn-client mode

2016-02-09 Thread Jonathan Kelly
In yarn-client mode, the driver is separate from the AM. The AM is created
in YARN, and YARN controls where it goes (though you can somewhat control
it using YARN node labels--I just learned earlier today in a different
thread on this list that this can be controlled by
spark.yarn.am.labelExpression). Then what I understand is that the driver
talks to the AM in order to request additional YARN containers in which to
run executors.

In yarn-cluster mode, the SparkSubmit process outside of the cluster
creates the AM in YARN, and then what I understand is that the AM *becomes*
the driver (by invoking the driver's main method), and then it requests the
executor containers.

So yes, one difference between yarn-client and yarn-cluster mode is that in
yarn-client mode the driver and AM are separate, whereas they are the same
in yarn-cluster.

~ Jonathan
On Tue, Feb 9, 2016 at 9:57 PM praveen S  wrote:

> Can you explain what happens in yarn client mode?
>
> Regards,
> Praveen
> On 10 Feb 2016 10:55, "ayan guha"  wrote:
>
>> It depends on yarn-cluster and yarn-client mode.
>>
>> On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:
>>
>>> Hi,
>>>
>>> I have 2 questions when running the spark jobs on yarn in client mode :
>>>
>>> 1) Where is the AM(application master) created :
>>>
>>> A) is it created on the client where the job was submitted? i.e driver
>>> and AM on the same client?
>>> Or
>>> B) yarn decides where the the AM should be created?
>>>
>>> 2) Driver and AM run in different processes : is my assumption correct?
>>>
>>> Regards,
>>> Praveen
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>


Re: how to send JavaDStream RDD using foreachRDD using Java

2016-02-09 Thread unk1102
Hi Sachin, how did you write to Kafka from Spark I cant find the following
method sendString and sendDataAsString in KafkaUtils can you please guide?

KafkaUtil.sendString(p,topic,result.get(0)); 
 KafkaUtils.sendDataAsString(MTP,topicName, result.get(0)); 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-send-JavaDStream-RDD-using-foreachRDD-using-Java-tp21456p26183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Diwakar Dhanuskodi
Pass  on  all  hadoop conf files  as  spark-submit parameters in --files


Sent from Samsung Mobile.

 Original message From: Rachana Srivastava 
 Date:09/02/2016  22:53  
(GMT+05:30) To: user@spark.apache.org Cc:  
Subject: HADOOP_HOME are not set when try to run spark application 
in yarn cluster mode 
I am trying to run an application in yarn cluster mode.
 
Spark-Submit with Yarn Cluster
Here are setting of the shell script:
spark-submit --class "com.Myclass"  \
--num-executors 2 \
--executor-cores 2 \
--master yarn \
--supervise \
--deploy-mode cluster \
../target/ \
 
My application is working fine in yarn-client and local mode.
 
Excerpt for error when we submit application from spark-submit in yarn cluster 
mode.
 
&& HADOOP HOME correct path logged but still getting the 
error
/usr/lib/hadoop
&& HADOOP_CONF_DIR
/usr/lib/hadoop/etc/hadoop
...
Diagnostics: Exception from container-launch.
Container id: container_1454984479786_0006_02_01
Exit code: 15
Stack trace: ExitCodeException exitCode=15:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 
Further I am getting following error
ERROR DETAILS FROM YARN LOGS APPLICATIONID
INFO : org.apache.spark.deploy.yarn.ApplicationMaster - Registered signal 
handlers for [TERM, HUP, INT]
DEBUG: org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home 
directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:307)
at org.apache.hadoop.util.Shell.(Shell.java:332)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at 
org.apache.hadoop.yarn.conf.YarnConfiguration.(YarnConfiguration.java:590)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:52)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:386)
at 
org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:384)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:384)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:401)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:623)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 
I tried modifying spark-env.sh like following and I see Hadoop_Home logged but 
still getting above error:
Modified added following entries to spark-env.sh
export HADOOP_HOME="/usr/lib/hadoop"
echo "&& HADOOP HOME "
echo "$HADOOP_HOME"
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
echo "&& HADOOP_CONF_DIR "
echo "$HADOOP_CONF_DIR"