from:"Ruslan Dautkhanov"

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Spark 1.5 is the latest that I have access to and where this problem
happens.

I don't see it's fixed in master but I might be wrong. diff atatched.

https://raw.githubusercontent.com/apache/spark/branch-1.5/python/pyspark/sql/types.py
https://raw.githubusercontent.com/apache/spark/d57daf1f7732a7ac54a91fe112deeda0a254f9ef/python/pyspark/sql/types.py



-- 
Ruslan Dautkhanov

On Wed, Mar 16, 2016 at 4:44 PM, Reynold Xin  wrote:

> We probably should have the alias. Is this still a problem on master
> branch?
>
> On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov 
> wrote:
>
>> Running following:
>>
>> #fix schema for gaid which should not be Double
>>> from pyspark.sql.types import *
>>> customSchema = StructType()
>>> for (col,typ) in tsp_orig.dtypes:
>>> if col=='Agility_GAID':
>>> typ='string'
>>> customSchema.add(col,typ,True)
>>
>>
>> Getting
>>
>>   ValueError: Could not parse datatype: bigint
>>
>>
>> Looks like pyspark.sql.types doesn't know anything about bigint..
>> Should it be aliased to LongType in pyspark.sql.types?
>>
>> Thanks
>>
>>
>> On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov > > wrote:
>>
>>> Hello,
>>>
>>> Looking at
>>>
>>> https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html
>>>
>>> and can't wrap my head around how to convert string data types names to
>>> actual
>>> pyspark.sql.types data types?
>>>
>>> Does pyspark.sql.types has an interface to return StringType() for
>>> "string",
>>> IntegerType() for "integer" etc? If it doesn't exist it would be great
>>> to have such a
>>> mapping function.
>>>
>>> Thank you.
>>>
>>>
>>> ps. I have a data frame, and use its dtypes to loop through all columns
>>> to fix a few
>>> columns' data types as a workaround for SPARK-13866.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>
>>
>
683a684,806
> _FIXED_DECIMAL = re.compile("decimal\\(\\s*(\\d+)\\s*,\\s*(\\d+)\\s*\\)")
> 
> 
> _BRACKETS = {'(': ')', '[': ']', '{': '}'}
> 
> 
> def _parse_basic_datatype_string(s):
> if s in _all_atomic_types.keys():
> return _all_atomic_types[s]()
> elif s == "int":
> return IntegerType()
> elif _FIXED_DECIMAL.match(s):
> m = _FIXED_DECIMAL.match(s)
> return DecimalType(int(m.group(1)), int(m.group(2)))
> else:
> raise ValueError("Could not parse datatype: %s" % s)
> 
> 
> def _ignore_brackets_split(s, separator):
> """
> Splits the given string by given separator, but ignore separators inside 
> brackets pairs, e.g.
> given "a,b" and separator ",", it will return ["a", "b"], but given 
> "a, d", it will return
> ["a", "d"].
> """
> parts = []
> buf = ""
> level = 0
> for c in s:
> if c in _BRACKETS.keys():
> level += 1
> buf += c
> elif c in _BRACKETS.values():
> if level == 0:
> raise ValueError("Brackets are not correctly paired: %s" % s)
> level -= 1
> buf += c
> elif c == separator and level > 0:
> buf += c
> elif c == separator:
> parts.append(buf)
> buf = ""
> else:
> buf += c
> 
> if len(buf) == 0:
> raise ValueError("The %s cannot be the last char: %s" % (separator, 
> s))
> parts.append(buf)
> return parts
> 
> 
> def _parse_struct_fields_string(s):
> parts = _ignore_brackets_split(s, ",")
> fields = []
> for part in parts:
> name_and_type = _ignore_brackets_split(part, ":")
> if len(name_and_type) != 2:
> raise ValueError("The strcut field string format is: 
> 'field_name:field_type', " +
>  "but got: %s" % part)
> field_name = name_and_type[0].strip()
> field_type = _parse_datatype_string(name_and_type[1])
> fields.append(StructField(field_name, field_type))
> return StructType(fields)
> 
> 
> def _parse_datatype_string(s):
> """
&g

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Running following:

#fix schema for gaid which should not be Double
> from pyspark.sql.types import *
> customSchema = StructType()
> for (col,typ) in tsp_orig.dtypes:
> if col=='Agility_GAID':
> typ='string'
> customSchema.add(col,typ,True)


Getting

  ValueError: Could not parse datatype: bigint


Looks like pyspark.sql.types doesn't know anything about bigint..
Should it be aliased to LongType in pyspark.sql.types?

Thanks


On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov 
wrote:

> Hello,
>
> Looking at
>
> https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html
>
> and can't wrap my head around how to convert string data types names to
> actual
> pyspark.sql.types data types?
>
> Does pyspark.sql.types has an interface to return StringType() for
> "string",
> IntegerType() for "integer" etc? If it doesn't exist it would be great to
> have such a
> mapping function.
>
> Thank you.
>
>
> ps. I have a data frame, and use its dtypes to loop through all columns to
> fix a few
> columns' data types as a workaround for SPARK-13866.
>
>
> --
> Ruslan Dautkhanov
>

df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Hello,

Looking at
https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html

and can't wrap my head around how to convert string data types names to
actual
pyspark.sql.types data types?

Does pyspark.sql.types has an interface to return StringType() for "string",
IntegerType() for "integer" etc? If it doesn't exist it would be great to
have such a
mapping function.

Thank you.


ps. I have a data frame, and use its dtypes to loop through all columns to
fix a few
columns' data types as a workaround for SPARK-13866.


-- 
Ruslan Dautkhanov

Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can't be found

2016-03-11 Thread Ruslan Dautkhanov

Spark session dies out after ~40 hours when running against Hadoop Secure
cluster.

spark-submit has --principal and --keytab so kerberos ticket renewal works
fine according to logs.

Some happens with HDFS dfs connection?

These messages come up every 1 second:
  See complete stack: http://pastebin.com/QxcQvpqm

16/03/11 16:04:59 WARN hdfs.LeaseRenewer: Failed to renew lease for
> [DFSClient_NONMAPREDUCE_1534318438_13] for 2802 seconds.  Will retry
> shortly ...
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in
> cache


Then in 1 hour it stops trying:

16/03/11 16:18:17 WARN hdfs.DFSClient: Failed to renew lease for
> DFSClient_NONMAPREDUCE_1534318438_13 for 3600 seconds (>= hard-limit =3600
> seconds.) Closing all files being written ...
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in
> cache


It doesn't look it is Kerberos principal ticket renewal problem, because
that would expire much sooner (by default we have 12 hours), and from the
logs Spark kerberos ticket renewer works fine.

It's some sort of other hdfs delegation token renewal process that breaks?

RHEL 6.7
> Spark 1.5
> Hadoop 2.6


Found HDFS-5322, YARN-2648 that seem relevant, but I am not sure if it's
the same problem.
It seems Spark problem as I only seen this problem in Spark.
This is reproducible problem, just wait for ~40 hours and a Spark session
is no good.


Thanks,
Ruslan

binary file deserialization

2016-03-09 Thread Ruslan Dautkhanov

We have a huge binary file in a custom serialization format (e.g. header
tells the length of the record, then there is a varying number of items for
that record). This is produced by an old c++ application.
What would be best approach to deserialize it into a Hive table or a Spark
RDD?
Format is known and well documented.


-- 
Ruslan Dautkhanov

Re: Spark + Sentry + Kerberos don't add up?

2016-02-24 Thread Ruslan Dautkhanov

Turns to be it is a Spark issue

https://issues.apache.org/jira/browse/SPARK-13478




-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov 
wrote:

> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be as simple as
> https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal
> and --keytab parameters to be passed to spark-submit.
>
> As a workaround I just did kinit (using hues' keytab) and then launched
> Livy Server. It probably will work as long as kerberos ticket doesn't
> expire. That's it would be great to have support for --principal and
> --keytab parameters for spark-submit as explined in
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html
>
>
> The only problem I have currently is the above error stack in my previous
> email:
>
> The Spark session could not be created in the cluster:
>> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
>> UserGroupInformation.java:1671)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
>> SparkSubmit.scala:160)
>
>
>
> >> AFAIK Hive impersonation should be turned off when using Sentry
>
> Yep, exactly. That's what I did. It is disabled now. But looks like on
> other hand, Spark or Spark Notebook want to have that enabled?
> It tries to do org.apache.hadoop.security.UserGroupInformation.doAs()
> hence the error.
>
> So Sentry isn't compatible with Spark in kerberized clusters? Is any
> workaround for this problem?
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux 
> wrote:
>
>> Livy does not support any Kerberos yet
>> https://issues.cloudera.org/browse/LIVY-3
>>
>> Are you focusing instead about HS2 + Kerberos with Sentry?
>>
>> AFAIK Hive impersonation should be turned off when using Sentry:
>> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>>
>> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov > > wrote:
>>
>>> Getting following error stack
>>>
>>> The Spark session could not be created in the cluster:
>>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>>> (UserGroupInformation.java:1671)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>>> .open(HiveMetaStoreClient.java:466)
>>>> at
>>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>>> at
>>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>>> ... 35 more
>>>
>>>
>>> My understanding that hive.server2.enable.impersonation and
>>> hive.server2.enable.doAs should be enabled to make
>>> UserGroupInformation.doAs() work?
>>>
>>> When I try to enable these parameters, Cloudera Manager shows error
>>>
>>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>>> (hostname)'.
>>>> Hive Impersonation should be disabled to enable Hive authorization
>>>> using Sentry
>>>
>>>
>>> So Spark-Hive conflicts with Sentry!?
>>>
>>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>>> CDH 5.5.
>>>
>>> This is a kerberized cluster with Sentry.
>>>
>>> I was using hue's keytab as hue user is normally (by default in CDH) is
>>> allowed to impersonate to other users.
>>> So very convenient for Spark Notebooks.
>>>
>>> Any information to help solve this will be highly appreciated.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Hue-Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to hue-user+unsubscr...@cloudera.org.
>>>
>>
>>
>

spark.storage.memoryFraction for shuffle-only jobs

2016-02-04 Thread Ruslan Dautkhanov

For a Spark job that only does shuffling
(e.g. Spark SQL with joins, group bys, analytical functions, order bys),
but no explicit persistent RDDs nor dataframes (there are no .cache()es in
the code),
what would be the lowest recommended setting
for spark.storage.memoryFraction?

spark.storage.memoryFraction defaults to 0.6 which is quite huge for
shuffle-only jobs.
spark.shuffle.memoryFraction defaults to 0.2 in Spark 1.5.0.

Can I set spark.storage.memoryFraction to something low like 0.1 or even
lower?
And spark.shuffle.memoryFraction to something large like 0.9? or perhaps
even 1.0?


Thanks!

Re: Hive on Spark knobs

2016-01-29 Thread Ruslan Dautkhanov

Yep, I tried that. It seems you're right. Got an error that execution
engine has to be set to mr.

hive.execution.engine = mr

I did not keep exact error message/stack. It's probably disabled explicitly.


-- 
Ruslan Dautkhanov

On Thu, Jan 28, 2016 at 7:03 AM, Todd  wrote:

> Did you run hive on spark with spark 1.5 and hive 1.1?
> I think hive on spark doesn't support spark 1.5. There are compatibility
> issues.
>
>
> At 2016-01-28 01:51:43, "Ruslan Dautkhanov"  wrote:
>
>
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> There are quite a lot of knobs to tune for Hive on Spark.
>
> Above page recommends following settings:
>
> mapreduce.input.fileinputformat.split.maxsize=75000
>> hive.vectorized.execution.enabled=true
>> hive.cbo.enable=true
>> hive.optimize.reducededuplication.min.reducer=4
>> hive.optimize.reducededuplication=true
>> hive.orc.splits.include.file.footer=false
>> hive.merge.mapfiles=true
>> hive.merge.sparkfiles=false
>> hive.merge.smallfiles.avgsize=1600
>> hive.merge.size.per.task=25600
>> hive.merge.orcfile.stripe.level=true
>> hive.auto.convert.join=true
>> hive.auto.convert.join.noconditionaltask=true
>> hive.auto.convert.join.noconditionaltask.size=894435328
>> hive.optimize.bucketmapjoin.sortedmerge=false
>> hive.map.aggr.hash.percentmemory=0.5
>> hive.map.aggr=true
>> hive.optimize.sort.dynamic.partition=false
>> hive.stats.autogather=true
>> hive.stats.fetch.column.stats=true
>> hive.vectorized.execution.reduce.enabled=false
>> hive.vectorized.groupby.checkinterval=4096
>> hive.vectorized.groupby.flush.percent=0.1
>> hive.compute.query.using.stats=true
>> hive.limit.pushdown.memory.usage=0.4
>> hive.optimize.index.filter=true
>> hive.exec.reducers.bytes.per.reducer=67108864
>> hive.smbjoin.cache.rows=1
>> hive.exec.orc.default.stripe.size=67108864
>> hive.fetch.task.conversion=more
>> hive.fetch.task.conversion.threshold=1073741824
>> hive.fetch.task.aggr=false
>> mapreduce.input.fileinputformat.list-status.num-threads=5
>> spark.kryo.referenceTracking=false
>>
>> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch
>
>
> Did it work for everybody? It may take days if not weeks to try to tune
> all of these parameters for a specific job.
>
> We're on Spark 1.5 / Hive 1.1.
>
>
> ps. We have a job that can't get working well as a Hive job, so thought to
> use Hive on Spark instead. (a 3-table full outer joins with group by +
> collect_list). Spark should handle this much better.
>
>
> Ruslan
>
>
>

Hive on Spark knobs

2016-01-27 Thread Ruslan Dautkhanov

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

There are quite a lot of knobs to tune for Hive on Spark.

Above page recommends following settings:

mapreduce.input.fileinputformat.split.maxsize=75000
> hive.vectorized.execution.enabled=true
> hive.cbo.enable=true
> hive.optimize.reducededuplication.min.reducer=4
> hive.optimize.reducededuplication=true
> hive.orc.splits.include.file.footer=false
> hive.merge.mapfiles=true
> hive.merge.sparkfiles=false
> hive.merge.smallfiles.avgsize=1600
> hive.merge.size.per.task=25600
> hive.merge.orcfile.stripe.level=true
> hive.auto.convert.join=true
> hive.auto.convert.join.noconditionaltask=true
> hive.auto.convert.join.noconditionaltask.size=894435328
> hive.optimize.bucketmapjoin.sortedmerge=false
> hive.map.aggr.hash.percentmemory=0.5
> hive.map.aggr=true
> hive.optimize.sort.dynamic.partition=false
> hive.stats.autogather=true
> hive.stats.fetch.column.stats=true
> hive.vectorized.execution.reduce.enabled=false
> hive.vectorized.groupby.checkinterval=4096
> hive.vectorized.groupby.flush.percent=0.1
> hive.compute.query.using.stats=true
> hive.limit.pushdown.memory.usage=0.4
> hive.optimize.index.filter=true
> hive.exec.reducers.bytes.per.reducer=67108864
> hive.smbjoin.cache.rows=1
> hive.exec.orc.default.stripe.size=67108864
> hive.fetch.task.conversion=more
> hive.fetch.task.conversion.threshold=1073741824
> hive.fetch.task.aggr=false
> mapreduce.input.fileinputformat.list-status.num-threads=5
> spark.kryo.referenceTracking=false
>
> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch


Did it work for everybody? It may take days if not weeks to try to tune all
of these parameters for a specific job.

We're on Spark 1.5 / Hive 1.1.


ps. We have a job that can't get working well as a Hive job, so thought to
use Hive on Spark instead. (a 3-table full outer joins with group by +
collect_list). Spark should handle this much better.


Ruslan

Re: Spark + Sentry + Kerberos don't add up?

2016-01-20 Thread Ruslan Dautkhanov

I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36
Feel free to close it if doesn't belong to Livy project.
I really don't know if this is a Spark or a Livy/Sentry problem.

Any ideas for possible workarounds?

Thank you.



-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov 
wrote:

> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be as simple as
> https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal
> and --keytab parameters to be passed to spark-submit.
>
> As a workaround I just did kinit (using hues' keytab) and then launched
> Livy Server. It probably will work as long as kerberos ticket doesn't
> expire. That's it would be great to have support for --principal and
> --keytab parameters for spark-submit as explined in
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html
>
>
> The only problem I have currently is the above error stack in my previous
> email:
>
> The Spark session could not be created in the cluster:
>> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
>> UserGroupInformation.java:1671)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
>> SparkSubmit.scala:160)
>
>
>
> >> AFAIK Hive impersonation should be turned off when using Sentry
>
> Yep, exactly. That's what I did. It is disabled now. But looks like on
> other hand, Spark or Spark Notebook want to have that enabled?
> It tries to do org.apache.hadoop.security.UserGroupInformation.doAs()
> hence the error.
>
> So Sentry isn't compatible with Spark in kerberized clusters? Is any
> workaround for this problem?
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux 
> wrote:
>
>> Livy does not support any Kerberos yet
>> https://issues.cloudera.org/browse/LIVY-3
>>
>> Are you focusing instead about HS2 + Kerberos with Sentry?
>>
>> AFAIK Hive impersonation should be turned off when using Sentry:
>> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>>
>> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov > > wrote:
>>
>>> Getting following error stack
>>>
>>> The Spark session could not be created in the cluster:
>>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>>> (UserGroupInformation.java:1671)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>>> .open(HiveMetaStoreClient.java:466)
>>>> at
>>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>>> at
>>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>>> ... 35 more
>>>
>>>
>>> My understanding that hive.server2.enable.impersonation and
>>> hive.server2.enable.doAs should be enabled to make
>>> UserGroupInformation.doAs() work?
>>>
>>> When I try to enable these parameters, Cloudera Manager shows error
>>>
>>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>>> (hostname)'.
>>>> Hive Impersonation should be disabled to enable Hive authorization
>>>> using Sentry
>>>
>>>
>>> So Spark-Hive conflicts with Sentry!?
>>>
>>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>>> CDH 5.5.
>>>
>>> This is a kerberized cluster with Sentry.
>>>
>>> I was using hue's keytab as hue user is normally (by default in CDH) is
>>> allowed to impersonate to other users.
>>> So very convenient for Spark Notebooks.
>>>
>>> Any information to help solve this will be highly appreciated.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Hue-Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to hue-user+unsubscr...@cloudera.org.
>>>
>>
>>
>

Re: Spark + Sentry + Kerberos don't add up?

2016-01-18 Thread Ruslan Dautkhanov

Hi Romain,

Thank you for your response.

Adding Kerberos support might be as simple as
https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal and
--keytab parameters to be passed to spark-submit.

As a workaround I just did kinit (using hues' keytab) and then launched
Livy Server. It probably will work as long as kerberos ticket doesn't
expire. That's it would be great to have support for --principal and
--keytab parameters for spark-submit as explined in
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html


The only problem I have currently is the above error stack in my previous
email:

The Spark session could not be created in the cluster:
> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
> UserGroupInformation.java:1671)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
> SparkSubmit.scala:160)



>> AFAIK Hive impersonation should be turned off when using Sentry

Yep, exactly. That's what I did. It is disabled now. But looks like on
other hand, Spark or Spark Notebook want to have that enabled?
It tries to do org.apache.hadoop.security.UserGroupInformation.doAs() hence
the error.

So Sentry isn't compatible with Spark in kerberized clusters? Is any
workaround for this problem?


-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux  wrote:

> Livy does not support any Kerberos yet
> https://issues.cloudera.org/browse/LIVY-3
>
> Are you focusing instead about HS2 + Kerberos with Sentry?
>
> AFAIK Hive impersonation should be turned off when using Sentry:
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>
> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov 
> wrote:
>
>> Getting following error stack
>>
>> The Spark session could not be created in the cluster:
>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>> (UserGroupInformation.java:1671)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>> .open(HiveMetaStoreClient.java:466)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>> ... 35 more
>>
>>
>> My understanding that hive.server2.enable.impersonation and
>> hive.server2.enable.doAs should be enabled to make
>> UserGroupInformation.doAs() work?
>>
>> When I try to enable these parameters, Cloudera Manager shows error
>>
>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>> (hostname)'.
>>> Hive Impersonation should be disabled to enable Hive authorization using
>>> Sentry
>>
>>
>> So Spark-Hive conflicts with Sentry!?
>>
>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>> CDH 5.5.
>>
>> This is a kerberized cluster with Sentry.
>>
>> I was using hue's keytab as hue user is normally (by default in CDH) is
>> allowed to impersonate to other users.
>> So very convenient for Spark Notebooks.
>>
>> Any information to help solve this will be highly appreciated.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Hue-Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to hue-user+unsubscr...@cloudera.org.
>>
>
>

Spark + Sentry + Kerberos don't add up?

2016-01-17 Thread Ruslan Dautkhanov

Getting following error stack

The Spark session could not be created in the cluster:
> at org.apache.hadoop.security.*UserGroupInformation.doAs*
> (UserGroupInformation.java:1671)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
> .open(HiveMetaStoreClient.java:466)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
> at
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> ... 35 more


My understanding that hive.server2.enable.impersonation and
hive.server2.enable.doAs should be enabled to make
UserGroupInformation.doAs() work?

When I try to enable these parameters, Cloudera Manager shows error

Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
> (hostname)'.
> Hive Impersonation should be disabled to enable Hive authorization using
> Sentry


So Spark-Hive conflicts with Sentry!?

Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master). CDH
5.5.

This is a kerberized cluster with Sentry.

I was using hue's keytab as hue user is normally (by default in CDH) is
allowed to impersonate to other users.
So very convenient for Spark Notebooks.

Any information to help solve this will be highly appreciated.


-- 
Ruslan Dautkhanov

livy test problem: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project livy-spark_2.10: There are test failures

2016-01-14 Thread Ruslan Dautkhanov

Livy build test from master fails with below problem. Can't track it down.

YARN shows Livy Spark yarn application as running.
Although attempt to connect to application master shows connection refused:

HTTP ERROR 500
> Problem accessing /proxy/application_1448640910222_0046/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)


Not sure if Livy server has application master UI?

CDH 5.5.1.

Below mvn test output footer:



> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] livy-main .. SUCCESS [
>  1.299 s]
> [INFO] livy-api_2.10 .. SUCCESS [
>  3.622 s]
> [INFO] livy-client-common_2.10  SUCCESS [
>  0.862 s]
> [INFO] livy-client-local_2.10 . SUCCESS [
> 23.866 s]
> [INFO] livy-core_2.10 . SUCCESS [
>  0.316 s]
> [INFO] livy-repl_2.10 . SUCCESS [01:00
> min]
> [INFO] livy-yarn_2.10 . SUCCESS [
>  0.215 s]
> [INFO] livy-spark_2.10  FAILURE [
> 17.382 s]
> [INFO] livy-server_2.10 ... SKIPPED
> [INFO] livy-assembly_2.10 . SKIPPED
> [INFO] livy-client-http_2.10 .. SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 01:48 min
> [INFO] Finished at: 2016-01-14T14:34:28-07:00
> [INFO] Final Memory: 27M/453M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> livy-spark_2.10: There are test failures -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> livy-spark_2.10: There are test failures
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoFailureException: There are test
> failures
> at org.scalatest.tools.maven.TestMojo.execute(TestMojo.java:107)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> [ERROR]
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

Try Phoenix from Cloudera parcel distribution

https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/

They may have better Kerberos support ..

On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Yes, its a kerberized cluster and ticket was generated using kinit command
> before running spark job. That's why Spark on hbase worked but when phoenix
> is used to get the connection to hbase, it does not pass the authentication
> to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark
> 1.3.1 does not provide integration with Phoenix for kerberized cluster.
>
> Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured
> cluster or not?
>
> Thanks,
> Akhilesh
>
> On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov 
> wrote:
>
>> That error is not directly related to spark nor hbase
>>
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>>
>> Is this a kerberized cluster? You likely don't have a good (non-expired)
>> kerberos ticket for authentication to pass.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
>> pathodia.akhil...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running spark job on yarn in cluster mode in secured cluster. I am
>>> trying to run Spark on Hbase using Phoenix, but Spark executors are
>>> unable to get hbase connection using phoenix. I am running knit command to
>>> get the ticket before starting the job and also keytab file and principal
>>> are correctly specified in connection URL. But still spark job on each node
>>> throws below error:
>>>
>>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>>> failed. The most likely cause is missing or invalid credentials. Consider
>>> 'kinit'.
>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>> GSSException: No valid credentials provided (Mechanism level: Failed to
>>> find any Kerberos tgt)]
>>> at
>>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>>
>>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
>>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
>>> this link:
>>>
>>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>>>
>>> Also, I found that there is a known issue for yarn-cluster mode for
>>> Spark 1.3.1 version:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6918
>>>
>>> Has anybody been successful in running Spark on hbase using Phoenix in
>>> yarn cluster or client mode?
>>>
>>> Thanks,
>>> Akhilesh Pathodia
>>>
>>
>>
>

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

That error is not directly related to spark nor hbase

javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]

Is this a kerberized cluster? You likely don't have a good (non-expired)
kerberos ticket for authentication to pass.


-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. I am
> trying to run Spark on Hbase using Phoenix, but Spark executors are
> unable to get hbase connection using phoenix. I am running knit command to
> get the ticket before starting the job and also keytab file and principal
> are correctly specified in connection URL. But still spark job on each node
> throws below error:
>
> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
> this link:
>
> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>
> Also, I found that there is a known issue for yarn-cluster mode for Spark
> 1.3.1 version:
>
> https://issues.apache.org/jira/browse/SPARK-6918
>
> Has anybody been successful in running Spark on hbase using Phoenix in
> yarn cluster or client mode?
>
> Thanks,
> Akhilesh Pathodia
>

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov

How many reducers you had that created those avro files?
Each reducer very likely creates its own avro part- file.

We normally use Parquet, but it should be the same for Avro, so this might
be
relevant
http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289




-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 11:27 AM, Test One  wrote:

> I'm using spark-avro with SparkSQL to process and output avro files. My
> data has the following schema:
>
> root
>  |-- memberUuid: string (nullable = true)
>  |-- communityUuid: string (nullable = true)
>  |-- email: string (nullable = true)
>  |-- firstName: string (nullable = true)
>  |-- lastName: string (nullable = true)
>  |-- username: string (nullable = true)
>  |-- profiles: map (nullable = true)
>  ||-- key: string
>  ||-- value: string (valueContainsNull = true)
>
>
> When I write the file output as such with:
> originalDF.write.avro("masterNew.avro")
>
> The output location is a folder with masterNew.avro and with many many
> files like these:
> -rw-r--r--   1 kcsham  access_bpf 8 Dec  2 11:37 ._SUCCESS.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf 0 Dec  2 11:37 _SUCCESS
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro
>
>
> Where there are ~10 record, it has ~28000 files in that folder. When I
> simply want to copy the same dataset to a new location as an exercise from
> a local master, it takes long long time and having errors like such as
> well.
>
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Not enough space to cache
> rdd_112058_10705 in memory! (computed 496.0 B so far)
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
> disk instead.
> [Stage 0:===>   (10706 + 1) /
> 28014]22:01:44.574 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.
>
>
> I'm attributing that there are way too many files to manipulate. The
> questions:
>
> 1. Is this the expected format of the avro file written by spark-avro? and
> each 'part-' is not more than 4k?
> 2. My use case is to append new records to the existing dataset using:
> originalDF.unionAll(stageDF).write.avro(masterNew)
> Any sqlconf, sparkconf that I should set to allow this to work?
>
>
> Thanks,
> kc
>
>
>

Re: question about combining small parquet files

2015-11-26 Thread Ruslan Dautkhanov

An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/


AFAIK Spark supports views too.


-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
nyigitb...@netflix.com.invalid> wrote:

> Hi Spark people,
> I have a Hive table that has a lot of small parquet files and I am
> creating a data frame out of it to do some processing, but since I have a
> large number of splits/files my job creates a lot of tasks, which I don't
> want. Basically what I want is the same functionality that Hive provides,
> that is, to combine these small input splits into larger ones by specifying
> a max split size setting. Is this currently possible with Spark?
>
> I look at coalesce() but with coalesce I can only control the number
> of output files not their sizes. And since the total input dataset size
> can vary significantly in my case, I cannot just use a fixed partition
> count as the size of each output file can get very large. I then looked for
> getting the total input size from an rdd to come up with some heuristic to
> set the partition count, but I couldn't find any ways to do it (without
> modifying the spark source).
>
> Any help is appreciated.
>
> Thanks,
> Nezih
>
> PS: this email is the same as my previous email as I learned that my
> previous email ended up as spam for many people since I sent it through
> nabble, sorry for the double post.
>

Re: Spark REST Job server feedback?

2015-11-25 Thread Ruslan Dautkhanov

Very good question. From
http://gethue.com/new-notebook-application-for-spark-sql/

"Use Livy Spark Job Server from the Hue master repository instead of CDH
(it is currently much more advanced): see build & start the latest Livy
<https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server>
"

Although that post is from April 2015, not sure if it's still accurate.




-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar  wrote:

> Hi
>
> I had the same question. Anyone having used Livy and/opr SparkJobServer,
> would like their input.
>
> Regards
> Deenar
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
>
> On 8 October 2015 at 20:39, Tim Smith  wrote:
>
>> I am curious too - any comparison between the two. Looks like one is
>> Datastax sponsored and the other is Cloudera. Other than that, any
>> major/core differences in design/approach?
>>
>> Thanks,
>>
>> Tim
>>
>>
>> On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal <
>> ramirezquetza...@gmail.com> wrote:
>>
>>> Anyone has feedback on using Hue / Spark Job Server REST servers?
>>>
>>>
>>> http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/
>>>
>>> https://github.com/spark-jobserver/spark-jobserver
>>>
>>> Many thanks,
>>>
>>> Rami
>>>
>>
>>
>>
>> --
>>
>> --
>> Thanks,
>>
>> Tim
>>
>
>

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov

public long getTime()

Returns the number of milliseconds since January 1, 1970, 00:00:00 GMT
represented by this Date object.

http://docs.oracle.com/javase/7/docs/api/java/util/Date.html#getTime%28%29

Based on what you did i might be easier to get date partitioner from that.
Also, to get even more even distriubution you could use a hash function
from that not just a remainder.




-- 
Ruslan Dautkhanov

On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin 
wrote:

> I will answer my own question, since I figured it out.  Here is my answer
> in case anyone else has the same issue.
>
> My DateTimes were all without seconds and milliseconds since I wanted to
> group data belonging to the same minute. The hashCode() for Joda DateTimes
> which are one minute apart is a constant:
>
> scala> val now = DateTime.now
> now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
>
> scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - 
> now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
> res42: Int = 6
>
> As can be seen by this example, if the hashCode values are similarly
> spaced, they can end up in the same partition:
>
> scala> val nums = for(i <- 0 to 100) yield ((i*20 % 1000), i)
> nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), 
> (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), 
> (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), 
> (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), 
> (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), 
> (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), 
> (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), 
> (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), 
> (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), 
> (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), 
> (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
>
> scala> val rddNum = sc.parallelize(nums)
> rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at 
> parallelize at :23
>
> scala> val reducedNum = rddNum.reduceByKey(_+_)
> reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at 
> reduceByKey at :25
>
> scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, 
> true).collect.toList
>
> res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0)
>
> To distribute my data more evenly across the partitions I created my own
> custom Partitoiner:
>
> class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
>   def numPartitions: Int = rddNumPartitions
>   def getPartition(key: Any): Int = {
> key match {
>   case dateTime: DateTime =>
> val sum = dateTime.getYear + dateTime.getMonthOfYear +  
> dateTime.getDayOfMonth + dateTime.getMinuteOfDay  + dateTime.getSecondOfDay
> sum % numPartitions
>   case _ => 0
> }
>   }
> }
>
>
> On 20 November 2015 at 17:17, Patrick McGloin 
> wrote:
>
>> Hi,
>>
>> I have Spark application which contains the following segment:
>>
>> val reparitioned = rdd.repartition(16)
>> val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, 
>> endDate)
>> val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, 
>> kv._2))
>> val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
>>
>> When I run this with some logging this is what I see:
>>
>> reparitioned ==> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 
>> 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
>> filtered ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 
>> 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
>> mapped ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 
>> 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
>> reduced ==> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
>>
>> My logging is done using these two lines:
>>
>> val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, 
>> true)log.info(s"rdd ==> [${sizes.collect.toList}]")
>>
>> My question is why does my data end up in one partition after the
>> reduceByKey? After the filter it can be seen that the data is evenly
>> distributed, but the reduceByKey results in data in only one partition.
>>
>> Thanks,
>>
>> Patrick
>>
>
>

Re: ISDATE Function

2015-11-18 Thread Ruslan Dautkhanov

You could write your own UDF isdate().



-- 
Ruslan Dautkhanov

On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani  wrote:

> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu  wrote:
>
>> ISDATE() is currently not supported.
>> Since it is SQL Server specific, I guess it wouldn't be added to Spark.
>>
>> On Mon, Nov 16, 2015 at 10:46 PM, Ravisankar Mani 
>> wrote:
>>
>>> Hi Everyone,
>>>
>>>
>>>  In MSSQL server suppprt "ISDATE()" function is used to fine current
>>> column values date or not?.  Is any possible to achieve current column
>>> values date or not?
>>>
>>>
>>> Regards,
>>> Ravi
>>>
>>
>>
>

Re: kerberos question

2015-11-06 Thread Ruslan Dautkhanov

You could probably instead of specifying --principal [principle] --keytab
[keytab] --proxy-user [proxied_user] ...
arguments just create/renew a kerberos ticket before submitting a job
$ kinit prinipal.name -kt keytab.file
$ spark-submit ...

Do you need impersonation / proxy user at all? I thought it's primary use
is for Hue and similar services which uses impersonation quite heavily in
kerberized cluster.



-- 
Ruslan Dautkhanov

On Wed, Nov 4, 2015 at 1:40 PM, Ted Yu  wrote:

> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0]
> hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) -
> Could not find uri with key [dfs. encryption.key.provider.uri] to
> create a keyProvider !!
>
> Could it be related to HDFS-7931 ?
>
> On Wed, Nov 4, 2015 at 12:30 PM, Chen Song  wrote:
>
>> After a bit more investigation, I found that it could be related to
>> impersonation on kerberized cluster.
>>
>> Our job is started with the following command.
>>
>> /usr/lib/spark/bin/spark-submit --master yarn-client --principal [principle] 
>> --keytab [keytab] --proxy-user [proxied_user] ...
>>
>>
>> In application master's log,
>>
>> At start up,
>>
>> 2015-11-03 16:03:41,602 INFO  [main] yarn.AMDelegationTokenRenewer 
>> (Logging.scala:logInfo(59)) - Scheduling login from keytab in 64789744 
>> millis.
>>
>> Later on, when the delegation token renewer thread kicks in, it tries to
>> re-login with the specified principle with new credentials and tries to
>> write the new credentials into the over to the directory where the current
>> user's credentials are stored. However, with impersonation, because the
>> current user is a different user from the principle user, it fails with
>> permission error.
>>
>> 2015-11-04 10:03:31,366 INFO  [Delegation Token Refresh Thread-0] 
>> yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Attempting to 
>> login to KDC using principal: principal/host@domain
>> 2015-11-04 10:03:31,665 INFO  [Delegation Token Refresh Thread-0] 
>> yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Successfully 
>> logged into KDC.
>> 2015-11-04 10:03:31,702 INFO  [Delegation Token Refresh Thread-0] 
>> yarn.YarnSparkHadoopUtil (Logging.scala:logInfo(59)) - getting token for 
>> namenode: 
>> hdfs://hadoop_abc/user/proxied_user/.sparkStaging/application_1443481003186_0
>> 2015-11-04 10:03:31,904 INFO  [Delegation Token Refresh Thread-0] 
>> hdfs.DFSClient (DFSClient.java:getDelegationToken(1025)) - Created 
>> HDFS_DELEGATION_TOKEN token 389283 for principal on ha-hdfs:hadoop_abc
>> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] 
>> hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - 
>> Could not find uri with key [dfs.encryption.key.provider.uri] to create a 
>> keyProvider !!
>> 2015-11-04 10:03:31,944 WARN  [Delegation Token Refresh Thread-0] 
>> security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - 
>> PriviledgedActionException as:proxy-user (auth:SIMPLE) 
>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2015-11-04 10:03:31,945 WARN  [Delegation Token Refresh Thread-0] ipc.Client 
>> (Client.java:run(675)) - Exception encountered while connecting to the 
>> server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2015-11-04 10:03:31,945 WARN  [Delegation Token Refresh Thread-0] 
>> security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - 
>> PriviledgedActionException as:proxy-user (auth:SIMPLE) 
>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2015-11-04 10:03:31,963 WARN  [Delegation Token Refresh Thread-0] 
>> yarn.YarnSparkHadoopUtil (Logging.scala:logWarning(92)) - Error while 
>> attempting to list files from application staging dir
>> org.apache.hadoop.security.AccessControlException: Permission denied: 
>> user=principal, access=READ_EXECUTE, 
>> inode="/user/proxy-user/.sparkStaging/application_1443481003186_0":proxy-user:proxy-user:drwx--
>>
>>
>> Can someone confirm my understanding is right? The class relevant is
>> below,
>> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
>>
>> Chen
>>
>>

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ruslan Dautkhanov

https://issues.apache.org/jira/browse/SPARK-8992

Should be in 1.6?



-- 
Ruslan Dautkhanov

On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss  wrote:

> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
> B, 2013 4
>
>
> I need to convert the data to a new format:
> A ,4,12,1
> B,   24,,4
>
> Any idea how to make it in Spark Scala?
>
> Thanks
>
>

Re: save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov

Thank you Richard and Matthew.

DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned
earlier, we're on CDH 5.4 / Spark 1.3. No options for this version?


Best regards,
Ruslan Dautkhanov

On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas  wrote:

> Hi Ruslan,
>
> Here is some sample code which writes a DataFrame to a table in a Derby
> database:
>
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
>
> val binaryVal = Array[Byte] ( 1, 2, 3, 4 )
> val timestampVal = java.sql.Timestamp.valueOf("1996-01-01 03:30:36")
> val dateVal = java.sql.Date.valueOf("1996-01-01")
>
> val allTypes = sc.parallelize(
> Array(
>   (1,
>   1.toLong,
>   1.toDouble,
>   1.toFloat,
>   1.toShort,
>   1.toByte,
>   "true".toBoolean,
>   "one ring to rule them all",
>   binaryVal,
>   timestampVal,
>   dateVal,
>   BigDecimal.valueOf(42549.12)
>   )
> )).toDF(
>   "int_col",
>   "long_col",
>   "double_col",
>   "float_col",
>   "short_col",
>   "byte_col",
>   "boolean_col",
>   "string_col",
>   "binary_col",
>   "timestamp_col",
>   "date_col",
>   "decimal_col"
>   )
>
> val properties = new java.util.Properties()
>
> allTypes.write.jdbc("jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> "all_spark_types", properties)
>
> Hope this helps,
>
> Rick Hillegas
> STSM, IBM Analytics, Platform - IBM USA
>
>
> Ruslan Dautkhanov  wrote on 10/05/2015 02:44:20 PM:
>
> > From: Ruslan Dautkhanov 
> > To: user 
> > Date: 10/05/2015 02:45 PM
> > Subject: save DF to JDBC
> >
> > http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-
> > to-other-databases
> >
> > Spark JDBC can read data from JDBC, but can it save back to JDBC?
> > Like to an Oracle database through its jdbc driver.
> >
> > Also looked at SQL Context documentation
> > https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/
> > SQLContext.html
> > and can't find anything relevant.
> >
> > Thanks!
> >
> >
> > --
> > Ruslan Dautkhanov
>

save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov

http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases


Spark JDBC can read data from JDBC, but can it save back to JDBC?
Like to an Oracle database through its jdbc driver.

Also looked at SQL Context documentation
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SQLContext.html
and can't find anything relevant.

Thanks!


-- 
Ruslan Dautkhanov

Re: Spark Web UI + NGINX

2015-09-22 Thread Ruslan Dautkhanov

It should be really simple to setup..

Check this Hue + NGINX setup page
http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/

In that config file change
1)
> server_name NGINX_HOSTNAME;
to "Machine A, with a public IP"
2)
> server HUE_HOST1: max_fails=3;
> server HUE_HOST2: max_fails=3;
to "Machine B, where Spark is installed"
3)
You may want to adjust "location /static/" that fits your Spark Web UI..
4)
With a few more config lines you can setup SSL offloading too.



-- 
Ruslan Dautkhanov

On Thu, Sep 17, 2015 at 3:06 AM, Renato Perini 
wrote:

> Hello!
> I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI.
> I have 2 machines:
>1) Machine A, with a public IP. This machine will be used to access
> Spark Web UI on the Machine B through its private IP address.
>2) Machine B, where Spark is installed (standalone master cluster, 1
> worker node and the history server) not accessible from the outside.
>
> Basically I want to access the Spark Web UI through my Machine A using the
> URL:
> http://machine_A_ip_address/spark
>
> Any advised setup for Spark Web UI + nginx?
>
> Thank you.
>
>
>

Re: Spark data type guesser UDAF

2015-09-21 Thread Ruslan Dautkhanov

Does it deserve to be a JIRA in Spark / Spark MLLib?
How do you guys normally determine data types?

Frameworks like h2o automatically determine data type scanning a sample of
data, or whole dataset.
So then one can decide e.g. if a variable should be a categorical variable
or numerical.

Another use case is if you get an arbitrary data set (we get them quite
often), and want to save as a Parquet table.
Providing correct data types make parquet more space effiecient (and
probably more query-time performant, e.g.
better parquet bloom filters than just storing everything as
string/varchar).

-- 
Ruslan Dautkhanov

On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov 
wrote:

> Wanted to take something like this
>
> https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
> and create a Hive UDAF to create an aggregate function that returns a data
> type guess.
> Am I inventing a wheel?
> Does Spark have something like this already built-in?
> Would be very useful for new wide datasets to explore data. Would be
> helpful for ML too,
> e.g. to decide categorical vs numerical variables.
>
>
> Ruslan
>
>

Spark data type guesser UDAF

2015-09-17 Thread Ruslan Dautkhanov

Wanted to take something like this
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
and create a Hive UDAF to create an aggregate function that returns a data
type guess.
Am I inventing a wheel?
Does Spark have something like this already built-in?
Would be very useful for new wide datasets to explore data. Would be
helpful for ML too,
e.g. to decide categorical vs numerical variables.


Ruslan

Re: NGINX + Spark Web UI

2015-09-17 Thread Ruslan Dautkhanov

Similar setup for Hue
http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/

Might give you an idea.



-- 
Ruslan Dautkhanov

On Thu, Sep 17, 2015 at 9:50 AM, mjordan79  wrote:

> Hello!
> I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI.
> I have 2 machines:
> 1) Machine A, with a public IP. This machine will be used to access Spark
> Web UI on the Machine B through its private IP address.
> 2) Machine B, where Spark is installed (standalone master cluster, 1 worker
> node and the history server) not accessible from the outside.
>
> Basically I want to access the Spark Web UI through my Machine A using the
> URL:
> http://machine_A_ip_address/spark
>
> Currently I have this setup:
> http {
> proxy_set_header X-Real-IP $remote_addr;
> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
> proxy_set_header Host $http_host;
> proxy_set_header X-NginX-Proxy true;
> proxy_set_header X-Ssl on;
> }
>
> # Master cluster node
> upstream app_master {
> server machine_B_ip_address:8080;
> }
>
> # Slave worker node
> upstream app_worker {
> server machine_B_ip_address:8081;
> }
>
> # Job UI
> upstream app_ui {
> server machine_B_ip_address:4040;
> }
>
> # History server
> upstream app_history {
> server machine_B_ip_address:18080;
> }
>
> I'm really struggling in figuring out a correct location directive to make
> the whole thing work, not only for accessing all ports using the url /spark
> but also in making the links in the web app be transformed accordingly. Any
> idea?
>
>
> Any help really appreciated.
> Thank you in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NGINX-Spark-Web-UI-tp24726.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark ANN

2015-09-15 Thread Ruslan Dautkhanov

Thank you Alexander.
Sounds like quite a lot of good and exciting changes slated for Spark's ANN.
Looking forward to it.



-- 
Ruslan Dautkhanov

On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander 
wrote:

> Thank you, Feynman, this is helpful. The paper that I linked claims a big
> speedup with FFTs for large convolution size. Though as you mentioned there
> is no FFT transformer in Spark yet. Moreover, we would need a parallel
> version of FFTs to support batch computations. So it probably worth
> considering matrix-matrix multiplication for convolution optimization at
> least as a first version. It can also take advantage of data batches.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Wednesday, September 09, 2015 12:56 AM
>
> *To:* Ulanov, Alexander
> *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru
> *Subject:* Re: Spark ANN
>
>
>
> My 2 cents:
>
>
>
> * There is frequency domain processing available already (e.g. spark.ml
> DCT transformer) but no FFT transformer yet because complex numbers are not
> currently a Spark SQL datatype
>
> * We shouldn't assume signals are even, so we need complex numbers to
> implement the FFT
>
> * I have not closely studied the relative performance tradeoffs, so please
> do let me know if there's a significant difference in practice
>
>
>
>
>
>
>
> On Tue, Sep 8, 2015 at 5:46 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> That is an option too. Implementing convolutions with FFTs should be
> considered as well http://arxiv.org/pdf/1312.5851.pdf.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Tuesday, September 08, 2015 12:07 PM
> *To:* Ulanov, Alexander
> *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru
> *Subject:* Re: Spark ANN
>
>
>
> Just wondering, why do we need tensors? Is the implementation of convnets
> using im2col (see here <http://cs231n.github.io/convolutional-networks/>)
> insufficient?
>
>
>
> On Tue, Sep 8, 2015 at 11:55 AM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Ruslan, thanks for including me in the discussion!
>
>
>
> Dropout and other features such as Autoencoder were implemented, but not
> merged yet in order to have room for improving the internal Layer API. For
> example, there is an ongoing work with convolutional layer that
> consumes/outputs 2D arrays. We’ll probably need to change the Layer’s
> input/output type to tensors. This will influence dropout which will need
> some refactoring to handle tensors too. Also, all new components should
> have ML pipeline public interface. There is an umbrella issue for deep
> learning in Spark https://issues.apache.org/jira/browse/SPARK-5575 which
> includes various features of Autoencoder, in particular
> https://issues.apache.org/jira/browse/SPARK-10408. You are very welcome
> to join and contribute since there is a lot of work to be done.
>
>
>
> Best regards, Alexander
>
> *From:* Ruslan Dautkhanov [mailto:dautkha...@gmail.com]
> *Sent:* Monday, September 07, 2015 10:09 PM
> *To:* Feynman Liang
> *Cc:* Nick Pentreath; user; na...@yandex.ru
> *Subject:* Re: Spark ANN
>
>
>
> Found a dropout commit from avulanov:
>
>
> https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d
>
>
>
> It probably hasn't made its way to MLLib (yet?).
>
>
>
>
>
> --
> Ruslan Dautkhanov
>
>
>
> On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang 
> wrote:
>
> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
> the roadmap for 1.6 <https://issues.apache.org/jira/browse/SPARK-10324>
> though, and there is a spark package
> <http://spark-packages.org/package/rakeshchalasani/MLlib-dropout> for
> dropout regularized logistic regression.
>
>
>
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
> wrote:
>
> Thanks!
>
>
>
> It does not look Spark ANN yet supports dropout/dropconnect or any other
> techniques that help avoiding overfitting?
>
> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>
> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>
>
>
> ps. There is a small copy-paste typo in
>
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>
> should read B&C :)
>
>
>
>
>
> --
> Ruslan Dautkhanov
>
>
>
> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
> wrote:
>
> Backprop is used to compute the gradient here
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala

Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Ruslan Dautkhanov

Sathish,

Thanks for pointing to that.

https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm

That must be only part of Oracle's BDA codebase, not open-source Hive,
right?



-- 
Ruslan Dautkhanov

On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> I guess data pump export from Oracle could be fast option. Hive now has
> oracle data pump serde..
>
> https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm
>
>
>
> On Wed, Sep 9, 2015 at 4:41 AM Reynold Xin  wrote:
>
>> Using the JDBC data source is probably the best way.
>> http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases
>>
>> On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin  wrote:
>>
>>> What's the best way to import data from Oracle to Spark? Thanks!
>>>
>>>
>>> --
>>> Best regards!
>>>
>>> Lin,Cui
>>>
>>
>>

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Ruslan Dautkhanov

Using dataframe API is a good workaround.

Another way would be to use bind variables. I don't think Spark SQL
supports them.
That's what Dinesh probably meant by "was not able to find any API for
preparing the SQL statement safely avoiding injection".

E.g.

val sql_handler = sqlContext.sql("SELECT name FROM people WHERE age >=
:var1 AND age <= :var2").parse()

toddlers = sql_handler.execute("var1"->1, "var2"->3)

teenagers = sql_handler.execute(13, 19)

It's not possible to do a SQL Injection if Spark SQL would support bind
variables, as parameter would be always treated as variables and not part
of SQL. Also it's arguably easier for developers as you don't have to
escape/quote.

ps. Another advantage is Spark could parse and create plan once - but
execute multiple times.
http://www.akadia.com/services/ora_bind_variables.html
This point is more relevant for OLTP-like queries which Spark is probably
not yet good at (e.g. return a few rows quickly/ winthin a few ms).

-- 
Ruslan Dautkhanov

On Thu, Sep 10, 2015 at 12:07 PM, Michael Armbrust 
wrote:

> Either that or use the DataFrame API, which directly constructs query
> plans and thus doesn't suffer from injection attacks (and runs on the same
> execution engine).
>
> On Thu, Sep 10, 2015 at 12:10 AM, Sean Owen  wrote:
>
>> I don't think this is Spark-specific. Mostly you need to escape /
>> quote user-supplied values as with any SQL engine.
>>
>> On Thu, Sep 10, 2015 at 7:32 AM, V Dineshkumar
>>  wrote:
>> > Hi,
>> >
>> > What is the preferred way of avoiding SQL Injection while using Spark
>> SQL?
>> > In our use case we have to take the parameters directly from the users
>> and
>> > prepare the SQL Statement.I was not able to find any API for preparing
>> the
>> > SQL statement safely avoiding injection.
>> >
>> > Thanks,
>> > Dinesh
>> > Philips India
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Best way to import data from Oracle to Spark?

2015-09-08 Thread Ruslan Dautkhanov

You can also sqoop oracle data in

$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl
--username MOVIEDEMO --password welcome1 --table ACTIVITY

http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/


-- 
Ruslan Dautkhanov

On Tue, Sep 8, 2015 at 11:11 AM, Cui Lin  wrote:

> What's the best way to import data from Oracle to Spark? Thanks!
>
>
> --
> Best regards!
>
> Lin,Cui
>

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov

Found a dropout commit from avulanov:
https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d

It probably hasn't made its way to MLLib (yet?).



-- 
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang  wrote:

> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
> the roadmap for 1.6 <https://issues.apache.org/jira/browse/SPARK-10324>
> though, and there is a spark package
> <http://spark-packages.org/package/rakeshchalasani/MLlib-dropout> for
> dropout regularized logistic regression.
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
> wrote:
>
>> Thanks!
>>
>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>> techniques that help avoiding overfitting?
>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>
>> ps. There is a small copy-paste typo in
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>> should read B&C :)
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
>> wrote:
>>
>>> Backprop is used to compute the gradient here
>>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L579-L584>,
>>> which is then optimized by SGD or LBFGS here
>>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L878>
>>>
>>> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Haven't checked the actual code but that doc says "MLPC employes
>>>> backpropagation for learning the model. .."?
>>>>
>>>>
>>>>
>>>> —
>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>
>>>>
>>>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov >>> > wrote:
>>>>
>>>>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>>>>
>>>>> Implementation seems missing backpropagation?
>>>>> Was there is a good reason to omit BP?
>>>>> What are the drawbacks of a pure feedforward-only ANN?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> --
>>>>> Ruslan Dautkhanov
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov

Thanks!

It does not look Spark ANN yet supports dropout/dropconnect or any other
techniques that help avoiding overfitting?
http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
https://cs.nyu.edu/~wanli/dropc/dropc.pdf

ps. There is a small copy-paste typo in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
should read B&C :)

-- 
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
wrote:

> Backprop is used to compute the gradient here
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L579-L584>,
> which is then optimized by SGD or LBFGS here
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L878>
>
> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath 
> wrote:
>
>> Haven't checked the actual code but that doc says "MLPC employes
>> backpropagation for learning the model. .."?
>>
>>
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov 
>> wrote:
>>
>>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>>
>>> Implementation seems missing backpropagation?
>>> Was there is a good reason to omit BP?
>>> What are the drawbacks of a pure feedforward-only ANN?
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>
>>
>

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov

Read response from Cheng Lian  on Aug/27th - it
looks the same problem.

Workarounds
1. write that parquet file in Spark;
2. upgrade to Spark 1.5.


--
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov  wrote:

> No, it was created in Hive by CTAS, but any help is appreciated...
>
> On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov 
> wrote:
>
>> That parquet table wasn't created in Spark, is it?
>>
>> There was a recent discussion on this list that complex data types in
>> Spark prior to 1.5 often incompatible with Hive for example, if I remember
>> correctly.
>> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:
>>
>>> I am trying to read an (array typed) parquet file in spark-shell (Spark
>>> 1.4.1 with Hadoop 2.6):
>>>
>>> {code}
>>> $ bin/spark-shell
>>> log4j:WARN No appenders could be found for logger
>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>> log4j:WARN Please initialize the log4j system properly.
>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>>> for more info.
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
>>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
>>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>> users with modify permissions: Set(hivedata)
>>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
>>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
>>> server' on port 43731.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>>>   /_/
>>>
>>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.8.0)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
>>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
>>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
>>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>> users with modify permissions: Set(hivedata)
>>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
>>> 15/09/07 13:45:27 INFO Remoting: Starting remoting
>>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://sparkDriver@10.10.30.52:46083]
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
>>> on port 46083.
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
>>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
>>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
>>> 265.1 MB
>>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
>>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
>>> server' on port 38717.
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
>>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
>>> 4040. Attempting port 4041.
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
>>> port 4041.
>>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at
>>> http://10.10.30.52:4041
>>> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
>>> localhost
>>> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
>>> http://10.10.30.52:43731
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
>>> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
>>> 15/0

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov

That parquet table wasn't created in Spark, is it?

There was a recent discussion on this list that complex data types in Spark
prior to 1.5 often incompatible with Hive for example, if I remember
correctly.

On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:

> I am trying to read an (array typed) parquet file in spark-shell (Spark
> 1.4.1 with Hadoop 2.6):
>
> {code}
> $ bin/spark-shell
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
> server' on port 43731.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>   /_/
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
> 15/09/07 13:45:27 INFO Remoting: Starting remoting
> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.10.30.52:46083]
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
> on port 46083.
> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
> 265.1 MB
> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
> server' on port 38717.
> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
> localhost
> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
> http://10.10.30.52:43731
> 15/09/07 13:45:27 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
> localhost, 60973)
> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
> Spark context available as sc.
> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
> 15/09/07 13:45:29 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes
> with
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,

Spark ANN

2015-09-07 Thread Ruslan Dautkhanov

http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html

Implementation seems missing backpropagation?
Was there is a good reason to omit BP?
What are the drawbacks of a pure feedforward-only ANN?

Thanks!


-- 
Ruslan Dautkhanov

Re: Ranger-like Security on Spark

2015-09-03 Thread Ruslan Dautkhanov

You could define access in Sentry and enable permissions sync with HDFS, so
you could just grant access on Hive per-database or per-table basis. It
should work for Spark too, as Sentry will propage "grants" to HDFS acls.

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/sg_hdfs_sentry_sync.html




-- 
Ruslan Dautkhanov

On Thu, Sep 3, 2015 at 1:46 PM, Daniel Schulz 
wrote:

> Hi Matei,
>
> Thanks for your answer.
>
> My question is regarding simple authenticated Spark-on-YARN only, without
> Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my
> HDFS user and only be able to access files I am entitled to read/write?
> Will it enforce HDFS ACLs and Ranger policies as well?
>
> Best regards, Daniel.
>
> > On 03 Sep 2015, at 21:16, Matei Zaharia  wrote:
> >
> > If you run on YARN, you can use Kerberos, be authenticated as the right
> user, etc in the same way as MapReduce jobs.
> >
> > Matei
> >
> >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz 
> wrote:
> >>
> >> Hi,
> >>
> >> I really enjoy using Spark. An obstacle to sell it to our clients
> currently is the missing Kerberos-like security on a Hadoop with simple
> authentication. Are there plans, a proposal, or a project to deliver a
> Ranger plugin or something similar to Spark. The target is to differentiate
> users and their privileges when reading and writing data to HDFS? Is
> Kerberos my only option then?
> >>
> >> Kind regards, Daniel.
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: FAILED_TO_UNCOMPRESS error from Snappy

2015-08-20 Thread Ruslan Dautkhanov

https://issues.apache.org/jira/browse/SPARK-7660 ?


-- 
Ruslan Dautkhanov

On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio  wrote:

> Right after upgraded to 1.4.1, we started seeing this exception and yes we
> picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there
> anything I could try ? I don't have a repro case.
>
> org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:127)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:169)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:168)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:168)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> at
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
> at
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
> at
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
> at
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
> at
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1178)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Success.map(Try.scala:206)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
> ... 33 more
>
>
> --
> Kohki Nishio
>

Re: Spark Master HA on YARN

2015-08-16 Thread Ruslan Dautkhanov

There is no Spark master in YARN mode. It's standalone mode terminology.
In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
will be restarted
automatically by RM up to yarn.resourcemanager.am.max-retries
times (default is 2).

--
Ruslan Dautkhanov

On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta  wrote:

> Hi,
>
> Is Spark master high availability supported on YARN (yarn-client mode)
> analogous to
> https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
> ?
>
> Thanks
> Bhaskie
>

Re: Spark job workflow engine recommendations

2015-08-11 Thread Ruslan Dautkhanov

We use Talend, but not for Spark workflows.
Although it does have Spark componenets.

https://www.talend.com/download/talend-open-studio
It is free (commercial support available), easy to design and deploy
workflows.
Talend for BigData 6.0 was released as month ago.

Is anybody using Talend for Spark?



-- 
Ruslan Dautkhanov

On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu 
wrote:

> We are in the middle of figuring that out.  At the high level, we want to
> combine the best parts of existing workflow solutions.
>
> On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone  wrote:
>
>> Hien,
>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>> linkedin going to use for workflow scheduling? Is there something else
>> that's going to replace Azkaban?
>>
>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu  wrote:
>>
>>> In my opinion, choosing some particular project among its peers should
>>> leave enough room for future growth (which may come faster than you
>>> initially think).
>>>
>>> Cheers
>>>
>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu  wrote:
>>>
>>>> Scalability is a known issue due the the current architecture.  However
>>>> this will be applicable if you run more 20K jobs per day.
>>>>
>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu  wrote:
>>>>
>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>>> being phased out at LinkedIn because of scalability issues (though 
>>>>> UI-wise,
>>>>> Azkaban seems better).
>>>>>
>>>>> Vikram:
>>>>> I suggest you do more research in related projects (maybe using their
>>>>> mailing lists).
>>>>>
>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>
>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> Hi Vikram,
>>>>>>
>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>>> easy to use and has a nice scheduling and logging interface, as well as
>>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>>> whatever).
>>>>>>
>>>>>> However Spark support is not present directly - we run everything
>>>>>> with shell scripts and spark-submit. There is a plugin interface where 
>>>>>> one
>>>>>> could create a Spark plugin, but I found it very cumbersome when I did
>>>>>> investigate and didn't have the time to work through it to develop that.
>>>>>>
>>>>>> It has some quirks and while there is actually a REST API for adding
>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you
>>>>>> kinda have to figure it out for yourself. But in terms of ease of use I
>>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>>
>>>>>> Spark job server is good but as you say lacks some stuff like
>>>>>> scheduling and DAG type workflows (independent of spark-defined job 
>>>>>> flows).
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke 
>>>>>> wrote:
>>>>>>
>>>>>>> Check also falcon in combination with oozie
>>>>>>>
>>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I'm looking for open source workflow tools/engines that allow us
>>>>>>>>> to schedule spark jobs on a datastax cassandra cluster. Since there 
>>>>>>>>> are
>>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos 
>>>>>>>>> etc,
>>>>>>>>> I wanted to check with people here to see what they are using today.
>>>>>>>>>
>>>>>>>>> Some of the requirements of the workflow engine that I'm looking
>>>>>>>>> for are
>>>>>>>>>
>>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>>> production scale.
>>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after 
>>>>>>>>> B and
>>>>>>>>> C are finished. Don't need to write full blown java applications to 
>>>>>>>>> specify
>>>>>>>>> job parameters and dependencies. Should be very simple to use.
>>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>>>>>> time every hour or day or week or month.
>>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>>>>>> daily basis.
>>>>>>>>>
>>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>>>>>> towards making spark jobs run faster by sharing contexts between the 
>>>>>>>>> jobs
>>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark 
>>>>>>>>> job
>>>>>>>>> server and workflow engine would be ideal
>>>>>>>>>
>>>>>>>>> Thanks for the inputs
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: collect() works, take() returns "ImportError: No module named iter"

2015-08-10 Thread Ruslan Dautkhanov

There is was a similar problem reported before on this list.

Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?

>From error stack you use 2.7.10 |Anaconda 2.3.0
while OS/CDH version of Python is probably 2.6.



-- 
Ruslan Dautkhanov

On Mon, Aug 10, 2015 at 3:53 PM, YaoPau  wrote:

> I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
> iPython Notebook.  I'm getting collect() to work just fine, but take()
> errors.  (I'm having issues with collect() on other datasets ... but take()
> seems to break every time I run it.)
>
> My code is below.  Any thoughts?
>
> >> sc
> 
> >> sys.version
> '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
> 4.4.7 20120313 (Red Hat 4.4.7-1)]'
> >> hourly = sc.textFile('tester')
> >> hourly.collect()
> [u'a man',
>  u'a plan',
>  u'a canal',
>  u'panama']
> >> hourly = sc.textFile('tester')
> >> hourly.take(2)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   1 hourly = sc.textFile('tester')
> > 2 hourly.take(2)
>
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self,
> num)
>1223
>1224 p = range(partsScanned, min(partsScanned +
> numPartsToTry, totalParts))
> -> 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
> True)
>1226
>1227 items += res
>
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
> runJob(self, rdd, partitionFunc, partitions, allowLocal)
> 841 # SparkContext#runJob.
> 842 mappedRDD = rdd.mapPartitions(partitionFunc)
> --> 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
> mappedRDD._jrdd, javaPartitions, allowLocal)
> 844 return list(mappedRDD._collect_iterator_through_file(it))
> 845
>
>
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer,
> self.gateway_client,
> --> 538 self.target_id, self.name)
> 539
> 540 for temp_arg in temp_args:
>
>
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 10.0 (TID 47, dhd490101.autotrader.com):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File
>
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
> line 101, in main
> process()
>   File
>
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py",
> line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File
>
> "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py",
> line 236, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py", line
> 1220, in takeUpToNumLeft
> while taken < left:
> ImportError: No module named iter
>
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(

Re: TCP/IP speedup

2015-08-01 Thread Ruslan Dautkhanov

If your network is bandwidth-bound, you'll see setting jumbo frames (MTU
9000)
may increase bandwidth up to ~20%.

http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm
"Enabling Jumbo Frames across the cluster improves bandwidth"

If Spark workload is not network bandwidth-bound, I can see it'll be a few
percent to no improvement.



-- 
Ruslan Dautkhanov

On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus  wrote:

> H
>
> 2% huh.
>
>
> -- ttfn
> Simon Edelhaus
> California 2015
>
> On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra 
> wrote:
>
>> https://spark-summit.org/2015/events/making-sense-of-spark-performance/
>>
>> On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus  wrote:
>>
>>> Hi All!
>>>
>>> How important would be a significant performance improvement to TCP/IP
>>> itself, in terms of
>>> overall job performance improvement. Which part would be most
>>> significantly accelerated?
>>> Would it be HDFS?
>>>
>>> -- ttfn
>>> Simon Edelhaus
>>> California 2015
>>>
>>
>>
>

Re: Spark Number of Partitions Recommendations

2015-08-01 Thread Ruslan Dautkhanov

You should also take into account amount of memory that you plan to use.
It's advised not to give too much memory for each executor .. otherwise GC
overhead will go up.

Btw, why prime numbers?



-- 
Ruslan Dautkhanov

On Wed, Jul 29, 2015 at 3:31 AM, ponkin  wrote:

> Hi Rahul,
>
> Where did you see such a recommendation?
> I personally define partitions with the following formula
>
> partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores )
> )
>
> where
> nextPrimeNumberAbove(x) - prime number which is greater than x
> K - multiplicator  to calculate start with 1 and encrease untill join
> perfomance start to degrade
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Ruslan Dautkhanov

>> We want these use actions respond within 2 to 5 seconds.

I think this goal is a stretch for Spark. Some queries may run faster than
that on a large dataset,
but in general you can't put an SLA like this. For example if you have to
join some huge datasets,
you'll likely will be much over that. Spark is great for huge jobs and
it'll be much faster than MR.
I don't think Spark was designed with interactive queries in mind. For
example, although Spark is
"in-memory", its in-memory is only for a job. It's not like in traditional
RDBMS systems where you
have a persistent "buffer cache" or "in-memory columnar storage" (both are
Oracle terms)
If you have multiple users running interatactive BI queries, results that
were cached for first user
wouldn't be used by second user. Unless you invent something that would
keep a persistent
Spark context and serve users' requests and decided which RDDs to cache,
when and how.
At least that's my understanding how Spark works. If I'm wrong, I will be
glad to hear that as
we ran into the same questions.

As we use Cloudera's CDH, I'm not sure where Hortonworks are with their Tez
project,
but Tez has components that resemble closer to "buffer cache" or "in-memory
columnar storage" caching
from traditional RDBMS systems, and may get better and/or more predictable
performance on
BI queries.

-- 
Ruslan Dautkhanov

On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan 
wrote:

> All,
> I really appreciate anyone's input on this. We are having a very simple
> traditional OLAP query processing use case. Our use case is as follows.
>
>
> 1. We have a customer sales order table data coming from RDBMs table.
> 2. There are many dimension columns in the sales order table. For each of
> those dimensions, we have individual dimension tables that stores the
> dimension record sets.
> 3. We also have some BI like hierarchies that is defined for dimension data
> set.
>
> What we want for business users is as follows.?
>
> 1. We wanted to show some aggregated values from sales Order transaction
> table columns.
> 2. User would like to filter these with specific dimension values from
> dimension table.
> 3. User should be able to drill down from higher level to lower level by
> traversing hierarchy on dimension
>
>
> We want these use actions respond within 2 to 5 seconds.
>
>
> We are thinking about using SPARK as our backend enginee to sever data to
> these front end application.
>
>
> Has anyone tried using SPARK for these kind of use cases. These are all
> traditional use cases in BI space. If so, can SPARK respond to these
> queries
> with in 2 to 5 seconds for large data sets.
>
> Thanks,
> Renga
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-23 Thread Ruslan Dautkhanov

Or Spark on HBase )

http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/



-- 
Ruslan Dautkhanov

On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu  wrote:

> bq. that is, key-value stores
>
> Please consider HBase for this purpose :-)
>
> On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das 
> wrote:
>
>> I do not recommend using IndexRDD for state management in Spark
>> Streaming. What it does not solve out-of-the-box is checkpointing of
>> indexRDDs, which important because long running streaming jobs can lead to
>> infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey
>> operation which you can use, which gives state management capabilities.
>> Though for most flexible arbitrary look up of stuff, its better to use a
>> dedicated system that is designed and optimized for long term storage of
>> data, that is, key-value stores, databases, etc.
>>
>> On Tue, Jul 14, 2015 at 5:44 PM, Ted Yu  wrote:
>>
>>> Please take a look at SPARK-2365 which is in progress.
>>>
>>> On Tue, Jul 14, 2015 at 5:18 PM, swetha 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is IndexedRDD available in Spark 1.4.0? We would like to use this in
>>>> Spark
>>>> Streaming to do lookups/updates/deletes in RDDs using keys by storing
>>>> them
>>>> as key/value pairs.
>>>>
>>>> Thanks,
>>>> Swetha
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD-available-in-Spark-1-4-0-tp23841.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Ruslan Dautkhanov

>> the executor receives a SIGTERM (from whom???)

>From YARN Resource Manager.

Check if yarn fair scheduler preemption and/or speculative execution are
turned on,
then it's quite possible and not a bug.



-- 
Ruslan Dautkhanov

On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim  wrote:

> Based on my experience, YARN containers can get SIGTERM when
>
> - it produces too much logs and use up the hard drive
> - it uses off-heap memory more than what is given by
> spark.yarn.executor.memoryOverhead configuration. It might be due to too
> many classes loaded (less than MaxPermGen but more than memoryOverhead), or
> some other off-heap memory allocated by networking library, etc.
> - it opens too many file descriptors, which you can check on the executor
> node's /proc//fd/
>
> Does any of these apply to your situation?
>
> Jong Wook
>
> On Jul 7, 2015, at 19:16, Kostas Kougios 
> wrote:
>
> I am still receiving these weird sigterms on the executors. The driver
> claims
> it lost the executor, the executor receives a SIGTERM (from whom???)
>
> It doesn't seem a memory related issue though increasing memory takes the
> job a bit further or completes it. But why? there is no memory pressure on
> neither driver nor executor. And nothing in the logs indicating so.
>
> driver:
>
> 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in
> stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes)
> 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in
> stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240)
> 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
> or disconnected! Shutting down. cruncher05.stratified:32976
> 15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
> cruncher05.stratified: remote Rpc client disassociated
> 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
> from TaskSet 0.0
> 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
> or disconnected! Shutting down. cruncher05.stratified:32976
> 15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with
> remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>
> 15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage
> 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
> lost)
>
> gc log for driver, it doesnt look like it run outofmem:
>
> 2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
> 1764131K->1391211K(3393024K), 0.0102839 secs]
> 2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
> 1764971K->1391867K(3405312K), 0.0099062 secs]
> 2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
> 1782011K->1392596K(3401216K), 0.0167572 secs]
>
> executor:
>
> 15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0
> (TID 14750)
> 15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
> found, computing it
> 15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL 15: SIGTERM
> 15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called
>
> executor gc log (no outofmem as it seems):
> 2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
> 24696750K->23712939K(33523712K), 0.0416640 secs]
> 2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
> 24700520K->23722043K(33523712K), 0.0391156 secs]
> 2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
> 24709182K->23726510K(33518592K), 0.0390784 secs]
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: Spark equivalent for Oracle's analytical functions

2015-07-12 Thread Ruslan Dautkhanov

Should be part of Spark 1.4
https://issues.apache.org/jira/browse/SPARK-1442

I don't see it in the documentation though
https://spark.apache.org/docs/latest/sql-programming-guide.html


-- 
Ruslan Dautkhanov

On Mon, Jul 6, 2015 at 5:06 AM, gireeshp 
wrote:

> Is there any equivalent of Oracle's *analytical functions* in Spark SQL.
>
> For example, if I have following data set (say table T):
> /EID|DEPT
> 101|COMP
> 102|COMP
> 103|COMP
> 104|MARK/
>
> In Oracle, I can do something like
> /select EID, DEPT, count(1) over (partition by DEPT) CNT from T;/
>
> to get:
> /EID|DEPT|CNT
> 101|COMP|3
> 102|COMP|3
> 103|COMP|3
> 104|MARK|1/
>
> Can we do an equivalent query in Spark-SQL? Or what is the best method to
> get such results in Spark dataframes?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-equivalent-for-Oracle-s-analytical-functions-tp23646.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread Ruslan Dautkhanov

You can see what Spark SQL functions are supported in Spark by doing the
following in a notebook:
%sql show functions

https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html

I think Spark SQL support is currently around Hive ~0.11?



-- 
Ruslan Dautkhanov

On Tue, Jul 7, 2015 at 3:10 PM, Jeff J Li  wrote:

> I am trying to use the posexplode function in the HiveContext to
> auto-generate a sequence number. This feature is supposed to be available
> Hive 0.13.0.
>
> SELECT name, phone FROM contact LATERAL VIEW
> posexplode(phoneList.phoneNumber) phoneTable AS pos, phone
>
> My test program failed with the following
>
> java.lang.ClassNotFoundException: posexplode
> at java.net.URLClassLoader.findClass(URLClassLoader.java:665)
> at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:942)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:851)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:827)
> at
> org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:147)
> at
> org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
> at
> org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)
>
> Does spark support this Hive function posexplode? If not, how to patch it
> to support this? I am on Spark 1.3.1
>
> Thanks,
> Jeff Li
>
>
>

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread Ruslan Dautkhanov

Good question. I'd like to know the same. Although I think you'll loose
supportability.



-- 
Ruslan Dautkhanov

On Wed, Jul 8, 2015 at 2:03 AM, Ashish Dutt  wrote:

>
> Hi,
> I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4.
> I checked the documentation here
> <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_upgrade_cdh5_using_parcels.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7dc6__section_zd5_1yz_l4>
>  but
> I do not see any thing relevant
>
> Any suggestions directing to a solution are welcome.
>
> Thanks,
> Ashish
>

Re: Real-time data visualization with Zeppelin

2015-07-12 Thread Ruslan Dautkhanov

Don't think it is a Zeppelin problem.. RDDs are "immutable".
Unless you integrate something like IndexedRDD
http://spark-packages.org/package/amplab/spark-indexedrdd
into Zeppelin I think it's not possible.


-- 
Ruslan Dautkhanov

On Wed, Jul 8, 2015 at 3:24 PM, Brandon White 
wrote:

> Can you use a con job to update it every X minutes?
>
> On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya  > wrote:
>
>> Hi all – I’m just wondering if anyone has had success integrating Spark
>> Streaming with Zeppelin and actually dynamically updating the data in near
>> real-time. From my investigation, it seems that Zeppelin will only allow
>> you to display a snapshot of data, not a continuously updating table. Has
>> anyone figured out if there’s a way to loop a display command or how to
>> provide a mechanism to continuously update visualizations?
>>
>> Thank you,
>> Ilya Ganelin
>>
>>
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>
>

Re: Caching in spark

2015-07-12 Thread Ruslan Dautkhanov

Hi Akhil,

It's interesting if RDDs are stored internally in a columnar format as well?
Or it is only when an RDD is cached in SQL context, it is converted to
columnar format.
What about data frames?

Thanks!

-- 
Ruslan Dautkhanov

On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das 
wrote:

>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
>
> Thanks
> Best Regards
>
> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar 
> wrote:
>
>> Hi Guys,
>>
>> Can any one please share me how to use caching feature of spark via spark
>> sql queries?
>>
>> -Vinod
>>
>
>

Re: configuring max sum of cores and memory in cluster through command line

2015-07-05 Thread Ruslan Dautkhanov

It's not possible to specify YARN RM paramers at command line of
spark-submit time. You have to specify all resources that are available on
your cluster to YARN upfront. If you want to limit amount of resource
available for your Spark job, consider using YARN dynamic resource pools
instead

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_resource_pools.html




-- 
Ruslan Dautkhanov

On Thu, Jul 2, 2015 at 4:20 PM, Alexander Waldin 
wrote:

>  Hi,
>
> I'd like to specify the total sum of cores / memory as command line
> arguments with spark-submit. That is, I'd like to set
> yarn.nodemanager.resource.memory-mb and the
> yarn.nodemanager.resource.cpu-vcores parameters as described in this blog
> <http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/>
> post.
>
> when submitting through the command line, what is the correct way to do
> it? Is it:
>
> --conf spark.yarn.nodemanager.resource.memory-mb=54g
> --conf spark.yarn.nodemanager.resource.cpu-vcores=31
>
> or
>
> --conf yarn.nodemanager.resource.memory-mb=54g
> --conf yarn.nodemanager.resource.cpu-vcores=31
>
>
> or something else? I tried these, and I tried looking in the
> ResourceManager UI to see if they were set, but couldn't find them.
>
> Thanks!
>
> Alexander
>

Re: .NET on Apache Spark?

2015-07-05 Thread Ruslan Dautkhanov

Scala used to run on .NET
http://www.scala-lang.org/old/node/10299


-- 
Ruslan Dautkhanov

On Thu, Jul 2, 2015 at 1:26 PM, pedro  wrote:

> You might try using .pipe() and installing your .NET program as a binary
> across the cluster (or using addFile). Its not ideal to pipe things in/out
> along with the overhead, but it would work.
>
> I don't know much about IronPython, but perhaps changing the default python
> by changing your path might work?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NET-on-Apache-Spark-tp23578p23594.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Problem after enabling Hadoop native libraries

2015-06-30 Thread Ruslan Dautkhanov

You can run
hadoop checknative -a
and see if bzip2 is detected correctly.


-- 
Ruslan Dautkhanov

On Fri, Jun 26, 2015 at 10:18 AM, Marcelo Vanzin 
wrote:

> What master are you using? If this is not a "local" master, you'll need to
> set LD_LIBRARY_PATH on the executors also (using
> spark.executor.extraLibraryPath).
>
> If you are using local, then I don't know what's going on.
>
> On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh 
> wrote:
>
>> Hi,
>>  I'm having trouble reading Bzip2 compressed sequence files after I
>> enabled hadoop native libraries in spark.
>>
>> Running
>> LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/ $SPARK_HOME/bin/spark-submit
>> --class  gives the following error
>>
>> 5/06/26 00:48:02 INFO CodecPool: Got brand-new decompressor [.bz2]
>> 15/06/26 00:48:02 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID
>> 3)
>> java.lang.UnsupportedOperationException
>> at
>> org.apache.hadoop.io.compress.bzip2.BZip2DummyDecompressor.decompress(BZip2DummyDecompressor.java:32)
>> at
>> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
>> at
>> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>> at java.io.DataInputStream.readLong(DataInputStream.java:416)
>>
>> removing the LD_LIBRARY_PATH makes spark run fine but it gives the
>> following warning
>> WARN NativeCodeLoader: Unable to load native-hadoop library for your
>> platform... using builtin-java classes where applicable
>>
>> Has anyone else run into this issue ? Any help is welcome.
>>
>
>
>
> --
> Marcelo
>

Re: flume sinks supported by spark streaming

2015-06-23 Thread Ruslan Dautkhanov

https://spark.apache.org/docs/latest/streaming-flume-integration.html

Yep, avro sink is the correct one.



-- 
Ruslan Dautkhanov

On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid 
wrote:

> Hi!
>
>
> I want to integrate flume with spark streaming. I want to know which sink
> type of flume are supported by spark streaming? I saw one example using
> avroSink.
>
> Thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/flume-sinks-supported-by-spark-streaming-tp23462.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: [ERROR] Insufficient Space

2015-06-19 Thread Ruslan Dautkhanov

Vadim,

You could edit /etc/fstab, then issue mount -o remount to give more shared
memory online.

Didn't know Spark uses shared memory.

Hope this helps.

On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy 
wrote:

> Hello Spark Experts,
>
> I've been running a standalone Spark cluster on EC2 for a few months now,
> and today I get this error:
>
> "IOError: [Errno 28] No space left on device
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> OpenJDK 64-Bit Server VM warning: Insufficient space for shared memory
> file"
>
> I guess I need to resize the cluster. What's the best way to do that?
>
> Thanks,
> Vadim
> ᐧ
>

Re: Does MLLib has attribute importance?

2015-06-18 Thread Ruslan Dautkhanov

Got it. Thanks!



-- 
Ruslan Dautkhanov

On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng  wrote:

> ChiSqSelector calls an RDD of labeled points, where the label is the
> target. See
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L120
>
> On Wed, Jun 17, 2015 at 10:22 PM, Ruslan Dautkhanov
>  wrote:
> > Thank you Xiangrui.
> >
> > Oracle's attribute importance mining function have a target variable.
> > "Attribute importance is a supervised function that ranks attributes
> > according to their significance in predicting a target."
> > MLlib's ChiSqSelector does not have a target variable.
> >
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng  wrote:
> >>
> >> We don't have it in MLlib. The closest would be the ChiSqSelector,
> >> which works for categorical data. -Xiangrui
> >>
> >> On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com>
> >> wrote:
> >> > What would be closest equivalent in MLLib to Oracle Data Miner's
> >> > Attribute
> >> > Importance mining function?
> >> >
> >> >
> >> >
> http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
> >> >
> >> > Attribute importance is a supervised function that ranks attributes
> >> > according to their significance in predicting a target.
> >> >
> >> >
> >> > Best regards,
> >> > Ruslan Dautkhanov
> >
> >
>

Re: Does MLLib has attribute importance?

2015-06-17 Thread Ruslan Dautkhanov

Thank you Xiangrui.

Oracle's attribute importance mining function have a target variable.
"Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target."
MLlib's ChiSqSelector does not have a target variable.

-- 
Ruslan Dautkhanov

On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng  wrote:

> We don't have it in MLlib. The closest would be the ChiSqSelector,
> which works for categorical data. -Xiangrui
>
> On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov 
> wrote:
> > What would be closest equivalent in MLLib to Oracle Data Miner's
> Attribute
> > Importance mining function?
> >
> >
> http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
> >
> > Attribute importance is a supervised function that ranks attributes
> > according to their significance in predicting a target.
> >
> >
> > Best regards,
> > Ruslan Dautkhanov
>

Does MLLib has attribute importance?

2015-06-11 Thread Ruslan Dautkhanov

What would be closest equivalent in MLLib to Oracle Data Miner's Attribute
Importance mining function?

http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920

Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target.


Best regards,
Ruslan Dautkhanov

k-means for text mining in a streaming context

2015-06-08 Thread Ruslan Dautkhanov

Hello,

https://spark.apache.org/docs/latest/mllib-feature-extraction.html
would Feature Extraction and Transformation work in a streaming context?

Wanted to extract text features, build K-means clusters for streaming
context
to detect anomalies on a continuous text stream.

Would it be possible?


Best reagrds,
Ruslan Dautkhanov

Re: Spark Job always cause a node to reboot

2015-06-04 Thread Ruslan Dautkhanov

vm.swappiness=0? Some vendors recommend this set to 0 (zero), although I've
seen this causes even kernel to fail to allocate memory.
It may cause node reboot. If that's the case, set vm.swappiness to 5-10 and
decrease spark.*.memory. Your spark.driver.memory+
spark.executor.memory + OS + etc >> amount of memory node has.



-- 
Ruslan Dautkhanov

On Thu, Jun 4, 2015 at 8:59 AM, Chao Chen  wrote:

> Hi all,
> I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and
> Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory.
> One is configured as headnode running masters, and 3 others are workers
>
> But when I try to run the Pagerank from HiBench, it always cause a node to
> reboot during the middle of the work for all scala, java, and python
> versions. But works fine
> with the MapReduce version from the same benchmark.
>
> I also tried standalone deployment, got the same issue.
>
> My spark-defaults.conf
> spark.masteryarn-client
> spark.driver.memory 4g
> spark.executor.memory   4g
> spark.rdd.compress  false
>
>
> The job submit script is:
>
> bin/spark-submit  --properties-file
> HiBench/report/pagerank/spark/scala/conf/sparkbench/spark.conf --class
> org.apache.spark.examples.SparkPageRank --master yarn-client
> --num-executors 2 --executor-cores 4 --executor-memory 4G --driver-memory
> 4G
> HiBench/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar
> hdfs://discfarm:9000/HiBench/Pagerank/Input/edges
> hdfs://discfarm:9000/HiBench/Pagerank/Output 3
>
> What is problem with my configuration ? and How can I find the cause ?
>
> any help is welcome !
>
>
>
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to monitor Spark Streaming from Kafka?

2015-06-02 Thread Ruslan Dautkhanov

Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf




-- 
Ruslan Dautkhanov

On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg 
wrote:

> Thank you, Tathagata, Cody, Otis.
>
> - Dmitry
>
>
> On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> I think you can use SPM - http://sematext.com/spm - it will give you all
>> Spark and all Kafka metrics, including offsets broken down by topic, etc.
>> out of the box.  I see more and more people using it to monitor various
>> components in data processing pipelines, a la
>> http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/
>>
>> Otis
>>
>> On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg 
>> wrote:
>>
>>> Hi,
>>>
>>> What are some of the good/adopted approached to monitoring Spark
>>> Streaming
>>> from Kafka?  I see that there are things like
>>> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they
>>> all
>>> assume that Receiver-based streaming is used?
>>>
>>> Then "Note that one disadvantage of this approach (Receiverless Approach,
>>> #2) is that it does not update offsets in Zookeeper, hence
>>> Zookeeper-based
>>> Kafka monitoring tools will not show progress. However, you can access
>>> the
>>> offsets processed by this approach in each batch and update Zookeeper
>>> yourself".
>>>
>>> The code sample, however, seems sparse. What do you need to do here? -
>>>  directKafkaStream.foreachRDD(
>>>  new Function, Void>() {
>>>  @Override
>>>  public Void call(JavaPairRDD rdd) throws
>>> IOException {
>>>  OffsetRange[] offsetRanges =
>>> ((HasOffsetRanges)rdd).offsetRanges
>>>  // offsetRanges.length = # of Kafka partitions being
>>> consumed
>>>  ...
>>>  return null;
>>>  }
>>>  }
>>>  );
>>>
>>> and if these are updated, will KafkaOffsetMonitor work?
>>>
>>> Monitoring seems to center around the notion of a consumer group.  But in
>>> the receiverless approach, code on the Spark consumer side doesn't seem
>>> to
>>> expose a consumer group parameter.  Where does it go?  Can I/should I
>>> just
>>> pass in group.id as part of the kafkaParams HashMap?
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Ruslan Dautkhanov

It's not only about cores. Keep in mind spark.executor.cores also affects
available memeory for each task:

From
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

The memory available to each task is (spark.executor.memory *
spark.shuffle.memoryFraction *spark.shuffle.safetyFraction)/
spark.executor.cores. Memory fraction and safety fraction default to 0.2
and 0.8 respectively.

I'd test spark.executor.cores with 2,4,8 and 16 and see what makes your job
run faster..

-- 
Ruslan Dautkhanov

On Wed, May 27, 2015 at 6:46 PM, Mulugeta Mammo 
wrote:

> My executor has the following spec (lscpu):
>
> CPU(s): 16
> Core(s) per socket: 4
> Socket(s): 2
> Thread(s) per code: 2
>
> The CPU count is obviously 4*2*2 = 16. My question is what value is Spark
> expecting in SPARK_EXECUTOR_CORES ? The CPU count (16) or total # of cores
> (2 * 2 = 4) ?
>
> Thanks
>

Re: PySpark Logs location

2015-05-21 Thread Ruslan Dautkhanov

https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application

When log aggregation isn’t turned on, logs are retained locally on each
machine under YARN_APP_LOGS_DIR, which is usually configured to/tmp/logs or
$HADOOP_HOME/logs/userlogs depending on the Hadoop version and
installation. Viewing logs for a container requires going to the host that
contains them and looking in this directory. Subdirectories organize log
files by application ID and container ID.

You can enable log aggregation by changing yarn.log-aggregation-enable to
true so it'll be easier to see yarn application logs.

-- 
Ruslan Dautkhanov

On Thu, May 21, 2015 at 5:08 AM, Oleg Ruchovets 
wrote:

> Doesn't work for me so far ,
>using command but got such output. What should I check to fix the
> issue? Any configuration parameters  ...
>
>
> [root@sdo-hdp-bd-master1 ~]# yarn logs -applicationId
> application_1426424283508_0048
> 15/05/21 13:25:09 INFO impl.TimelineClientImpl: Timeline service address:
> http://hdp-bd-node1.development.c4i:8188/ws/v1/timeline/
> 15/05/21 13:25:09 INFO client.RMProxy: Connecting to ResourceManager at
> hdp-bd-node1.development.c4i/12.23.45.253:8050
> /app-logs/root/logs/application_1426424283508_0048does not exist.
> *Log aggregation has not completed or is not enabled.*
>
> Thanks
> Oleg.
>
> On Wed, May 20, 2015 at 11:33 PM, Ruslan Dautkhanov 
> wrote:
>
>> Oleg,
>>
>> You can see applicationId in your Spark History Server.
>> Go to http://historyserver:18088/
>>
>> Also check
>> https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application
>>
>> It should be no different with PySpark.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Wed, May 20, 2015 at 2:12 PM, Oleg Ruchovets 
>> wrote:
>>
>>> Hi Ruslan.
>>>   Could you add more details please.
>>> Where do I get applicationId? In case I have a lot of log files would it
>>> make sense to view it from single point.
>>> How actually I can configure / manage log location of PySpark?
>>>
>>> Thanks
>>> Oleg.
>>>
>>> On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov <
>>> dautkha...@gmail.com> wrote:
>>>
>>>> You could use
>>>>
>>>> yarn logs -applicationId application_1383601692319_0008
>>>>
>>>>
>>>>
>>>> --
>>>> Ruslan Dautkhanov
>>>>
>>>> On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets 
>>>> wrote:
>>>>
>>>>> Hi ,
>>>>>
>>>>>   I am executing PySpark job on yarn ( hortonworks distribution).
>>>>>
>>>>> Could someone pointing me where is the log locations?
>>>>>
>>>>> Thanks
>>>>> Oleg.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov

Oleg,

You can see applicationId in your Spark History Server.
Go to http://historyserver:18088/

Also check
https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application

It should be no different with PySpark.


-- 
Ruslan Dautkhanov

On Wed, May 20, 2015 at 2:12 PM, Oleg Ruchovets 
wrote:

> Hi Ruslan.
>   Could you add more details please.
> Where do I get applicationId? In case I have a lot of log files would it
> make sense to view it from single point.
> How actually I can configure / manage log location of PySpark?
>
> Thanks
> Oleg.
>
> On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov 
> wrote:
>
>> You could use
>>
>> yarn logs -applicationId application_1383601692319_0008
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets 
>> wrote:
>>
>>> Hi ,
>>>
>>>   I am executing PySpark job on yarn ( hortonworks distribution).
>>>
>>> Could someone pointing me where is the log locations?
>>>
>>> Thanks
>>> Oleg.
>>>
>>
>>
>

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov

You could use

yarn logs -applicationId application_1383601692319_0008



-- 
Ruslan Dautkhanov

On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets 
wrote:

> Hi ,
>
>   I am executing PySpark job on yarn ( hortonworks distribution).
>
> Could someone pointing me where is the log locations?
>
> Thanks
> Oleg.
>

Re: Reading Nested Fields in DataFrames

2015-05-11 Thread Ruslan Dautkhanov

Had the same question on stackoverflow recently
http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark


Lomig Mégard had a detailed answer of how to do this without using LATERAL
VIEW.


On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh 
wrote:

> Hi ,
> I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
> I need help to retrieve the Inner element data in the Structure below.
>
> Below is the schema when I enter df.printSchema :
>
>  |-- THROTTLING_PERCENTAGE: double (nullable = false)
>  |-- IMPRESSION_TYPE: string (nullable = false)
>  |-- campaignArray: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- COOKIE: string (nullable = false)
>  |||-- CAMPAIGN_ID: long (nullable = false)
>
>
> How can I access CAMPAIGN_ID field in this schema ?
>
> Thanks,
> Ashish Kr. Singh
>

68 matches

Mail list logo