from:"Ruslan Dautkhanov"

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Spark 1.5 is the latest that I have access to and where this problem
happens.

I don't see it's fixed in master but I might be wrong. diff atatched.

https://raw.githubusercontent.com/apache/spark/branch-1.5/python/pyspark/sql/types.py
https://raw.githubusercontent.com/apache/spark/d57daf1f7732a7ac54a91fe112deeda0a254f9ef/python/pyspark/sql/types.py



-- 
Ruslan Dautkhanov

On Wed, Mar 16, 2016 at 4:44 PM, Reynold Xin <r...@databricks.com> wrote:

> We probably should have the alias. Is this still a problem on master
> branch?
>
> On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> Running following:
>>
>> #fix schema for gaid which should not be Double
>>> from pyspark.sql.types import *
>>> customSchema = StructType()
>>> for (col,typ) in tsp_orig.dtypes:
>>> if col=='Agility_GAID':
>>> typ='string'
>>> customSchema.add(col,typ,True)
>>
>>
>> Getting
>>
>>   ValueError: Could not parse datatype: bigint
>>
>>
>> Looks like pyspark.sql.types doesn't know anything about bigint..
>> Should it be aliased to LongType in pyspark.sql.types?
>>
>> Thanks
>>
>>
>> On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov <dautkha...@gmail.com
>> > wrote:
>>
>>> Hello,
>>>
>>> Looking at
>>>
>>> https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html
>>>
>>> and can't wrap my head around how to convert string data types names to
>>> actual
>>> pyspark.sql.types data types?
>>>
>>> Does pyspark.sql.types has an interface to return StringType() for
>>> "string",
>>> IntegerType() for "integer" etc? If it doesn't exist it would be great
>>> to have such a
>>> mapping function.
>>>
>>> Thank you.
>>>
>>>
>>> ps. I have a data frame, and use its dtypes to loop through all columns
>>> to fix a few
>>> columns' data types as a workaround for SPARK-13866.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>
>>
>
683a684,806
> _FIXED_DECIMAL = re.compile("decimal\\(\\s*(\\d+)\\s*,\\s*(\\d+)\\s*\\)")
> 
> 
> _BRACKETS = {'(': ')', '[': ']', '{': '}'}
> 
> 
> def _parse_basic_datatype_string(s):
> if s in _all_atomic_types.keys():
> return _all_atomic_types[s]()
> elif s == "int":
> return IntegerType()
> elif _FIXED_DECIMAL.match(s):
> m = _FIXED_DECIMAL.match(s)
> return DecimalType(int(m.group(1)), int(m.group(2)))
> else:
> raise ValueError("Could not parse datatype: %s" % s)
> 
> 
> def _ignore_brackets_split(s, separator):
> """
> Splits the given string by given separator, but ignore separators inside 
> brackets pairs, e.g.
> given "a,b" and separator ",", it will return ["a", "b"], but given 
> "a<b,c>, d", it will return
> ["a<b,c>", "d"].
> """
> parts = []
> buf = ""
> level = 0
> for c in s:
> if c in _BRACKETS.keys():
> level += 1
> buf += c
> elif c in _BRACKETS.values():
> if level == 0:
> raise ValueError("Brackets are not correctly paired: %s" % s)
> level -= 1
> buf += c
> elif c == separator and level > 0:
> buf += c
> elif c == separator:
> parts.append(buf)
> buf = ""
> else:
> buf += c
> 
> if len(buf) == 0:
> raise ValueError("The %s cannot be the last char: %s" % (separator, 
> s))
> parts.append(buf)
> return parts
> 
> 
> def _parse_struct_fields_string(s):
> parts = _ignore_brackets_split(s, ",")
> fields = []
> for part in parts:
> name_and_type = _ignore_brackets_split(part, ":")
> if len(name_and_type) != 2:
> raise ValueError("The strcut field string format is: 
> 'field_name:field_type', " +
>  "but got: %s" % part)
> field_name = name_and_type[0].strip()
> field_type = _parse_datatype_string(name_and_type[1])
> fields.append(StructField(field_name, field_type))
> return StructType(fields)
> 
> 
> def _parse_datatype_string(s):
> """
> Parse

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Running following:

#fix schema for gaid which should not be Double
> from pyspark.sql.types import *
> customSchema = StructType()
> for (col,typ) in tsp_orig.dtypes:
> if col=='Agility_GAID':
> typ='string'
> customSchema.add(col,typ,True)


Getting

  ValueError: Could not parse datatype: bigint


Looks like pyspark.sql.types doesn't know anything about bigint..
Should it be aliased to LongType in pyspark.sql.types?

Thanks


On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Hello,
>
> Looking at
>
> https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html
>
> and can't wrap my head around how to convert string data types names to
> actual
> pyspark.sql.types data types?
>
> Does pyspark.sql.types has an interface to return StringType() for
> "string",
> IntegerType() for "integer" etc? If it doesn't exist it would be great to
> have such a
> mapping function.
>
> Thank you.
>
>
> ps. I have a data frame, and use its dtypes to loop through all columns to
> fix a few
> columns' data types as a workaround for SPARK-13866.
>
>
> --
> Ruslan Dautkhanov
>

df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Hello,

Looking at
https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html

and can't wrap my head around how to convert string data types names to
actual
pyspark.sql.types data types?

Does pyspark.sql.types has an interface to return StringType() for "string",
IntegerType() for "integer" etc? If it doesn't exist it would be great to
have such a
mapping function.

Thank you.


ps. I have a data frame, and use its dtypes to loop through all columns to
fix a few
columns' data types as a workaround for SPARK-13866.


-- 
Ruslan Dautkhanov

Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can't be found

2016-03-11 Thread Ruslan Dautkhanov

Spark session dies out after ~40 hours when running against Hadoop Secure
cluster.

spark-submit has --principal and --keytab so kerberos ticket renewal works
fine according to logs.

Some happens with HDFS dfs connection?

These messages come up every 1 second:
  See complete stack: http://pastebin.com/QxcQvpqm

16/03/11 16:04:59 WARN hdfs.LeaseRenewer: Failed to renew lease for
> [DFSClient_NONMAPREDUCE_1534318438_13] for 2802 seconds.  Will retry
> shortly ...
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in
> cache


Then in 1 hour it stops trying:

16/03/11 16:18:17 WARN hdfs.DFSClient: Failed to renew lease for
> DFSClient_NONMAPREDUCE_1534318438_13 for 3600 seconds (>= hard-limit =3600
> seconds.) Closing all files being written ...
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in
> cache


It doesn't look it is Kerberos principal ticket renewal problem, because
that would expire much sooner (by default we have 12 hours), and from the
logs Spark kerberos ticket renewer works fine.

It's some sort of other hdfs delegation token renewal process that breaks?

RHEL 6.7
> Spark 1.5
> Hadoop 2.6


Found HDFS-5322, YARN-2648 that seem relevant, but I am not sure if it's
the same problem.
It seems Spark problem as I only seen this problem in Spark.
This is reproducible problem, just wait for ~40 hours and a Spark session
is no good.


Thanks,
Ruslan

binary file deserialization

2016-03-09 Thread Ruslan Dautkhanov

We have a huge binary file in a custom serialization format (e.g. header
tells the length of the record, then there is a varying number of items for
that record). This is produced by an old c++ application.
What would be best approach to deserialize it into a Hive table or a Spark
RDD?
Format is known and well documented.


-- 
Ruslan Dautkhanov

Re: Spark + Sentry + Kerberos don't add up?

2016-02-24 Thread Ruslan Dautkhanov

Turns to be it is a Spark issue

https://issues.apache.org/jira/browse/SPARK-13478




-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be as simple as
> https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal
> and --keytab parameters to be passed to spark-submit.
>
> As a workaround I just did kinit (using hues' keytab) and then launched
> Livy Server. It probably will work as long as kerberos ticket doesn't
> expire. That's it would be great to have support for --principal and
> --keytab parameters for spark-submit as explined in
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html
>
>
> The only problem I have currently is the above error stack in my previous
> email:
>
> The Spark session could not be created in the cluster:
>> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
>> UserGroupInformation.java:1671)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
>> SparkSubmit.scala:160)
>
>
>
> >> AFAIK Hive impersonation should be turned off when using Sentry
>
> Yep, exactly. That's what I did. It is disabled now. But looks like on
> other hand, Spark or Spark Notebook want to have that enabled?
> It tries to do org.apache.hadoop.security.UserGroupInformation.doAs()
> hence the error.
>
> So Sentry isn't compatible with Spark in kerberized clusters? Is any
> workaround for this problem?
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux <rom...@cloudera.com>
> wrote:
>
>> Livy does not support any Kerberos yet
>> https://issues.cloudera.org/browse/LIVY-3
>>
>> Are you focusing instead about HS2 + Kerberos with Sentry?
>>
>> AFAIK Hive impersonation should be turned off when using Sentry:
>> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>>
>> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov <dautkha...@gmail.com
>> > wrote:
>>
>>> Getting following error stack
>>>
>>> The Spark session could not be created in the cluster:
>>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>>> (UserGroupInformation.java:1671)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>>> .open(HiveMetaStoreClient.java:466)
>>>> at
>>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>>> at
>>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>>> ... 35 more
>>>
>>>
>>> My understanding that hive.server2.enable.impersonation and
>>> hive.server2.enable.doAs should be enabled to make
>>> UserGroupInformation.doAs() work?
>>>
>>> When I try to enable these parameters, Cloudera Manager shows error
>>>
>>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>>> (hostname)'.
>>>> Hive Impersonation should be disabled to enable Hive authorization
>>>> using Sentry
>>>
>>>
>>> So Spark-Hive conflicts with Sentry!?
>>>
>>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>>> CDH 5.5.
>>>
>>> This is a kerberized cluster with Sentry.
>>>
>>> I was using hue's keytab as hue user is normally (by default in CDH) is
>>> allowed to impersonate to other users.
>>> So very convenient for Spark Notebooks.
>>>
>>> Any information to help solve this will be highly appreciated.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Hue-Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to hue-user+unsubscr...@cloudera.org.
>>>
>>
>>
>

spark.storage.memoryFraction for shuffle-only jobs

2016-02-04 Thread Ruslan Dautkhanov

For a Spark job that only does shuffling
(e.g. Spark SQL with joins, group bys, analytical functions, order bys),
but no explicit persistent RDDs nor dataframes (there are no .cache()es in
the code),
what would be the lowest recommended setting
for spark.storage.memoryFraction?

spark.storage.memoryFraction defaults to 0.6 which is quite huge for
shuffle-only jobs.
spark.shuffle.memoryFraction defaults to 0.2 in Spark 1.5.0.

Can I set spark.storage.memoryFraction to something low like 0.1 or even
lower?
And spark.shuffle.memoryFraction to something large like 0.9? or perhaps
even 1.0?


Thanks!

Re: Hive on Spark knobs

2016-01-29 Thread Ruslan Dautkhanov

Yep, I tried that. It seems you're right. Got an error that execution
engine has to be set to mr.

hive.execution.engine = mr

I did not keep exact error message/stack. It's probably disabled explicitly.


-- 
Ruslan Dautkhanov

On Thu, Jan 28, 2016 at 7:03 AM, Todd <bit1...@163.com> wrote:

> Did you run hive on spark with spark 1.5 and hive 1.1?
> I think hive on spark doesn't support spark 1.5. There are compatibility
> issues.
>
>
> At 2016-01-28 01:51:43, "Ruslan Dautkhanov" <dautkha...@gmail.com> wrote:
>
>
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> There are quite a lot of knobs to tune for Hive on Spark.
>
> Above page recommends following settings:
>
> mapreduce.input.fileinputformat.split.maxsize=75000
>> hive.vectorized.execution.enabled=true
>> hive.cbo.enable=true
>> hive.optimize.reducededuplication.min.reducer=4
>> hive.optimize.reducededuplication=true
>> hive.orc.splits.include.file.footer=false
>> hive.merge.mapfiles=true
>> hive.merge.sparkfiles=false
>> hive.merge.smallfiles.avgsize=1600
>> hive.merge.size.per.task=25600
>> hive.merge.orcfile.stripe.level=true
>> hive.auto.convert.join=true
>> hive.auto.convert.join.noconditionaltask=true
>> hive.auto.convert.join.noconditionaltask.size=894435328
>> hive.optimize.bucketmapjoin.sortedmerge=false
>> hive.map.aggr.hash.percentmemory=0.5
>> hive.map.aggr=true
>> hive.optimize.sort.dynamic.partition=false
>> hive.stats.autogather=true
>> hive.stats.fetch.column.stats=true
>> hive.vectorized.execution.reduce.enabled=false
>> hive.vectorized.groupby.checkinterval=4096
>> hive.vectorized.groupby.flush.percent=0.1
>> hive.compute.query.using.stats=true
>> hive.limit.pushdown.memory.usage=0.4
>> hive.optimize.index.filter=true
>> hive.exec.reducers.bytes.per.reducer=67108864
>> hive.smbjoin.cache.rows=1
>> hive.exec.orc.default.stripe.size=67108864
>> hive.fetch.task.conversion=more
>> hive.fetch.task.conversion.threshold=1073741824
>> hive.fetch.task.aggr=false
>> mapreduce.input.fileinputformat.list-status.num-threads=5
>> spark.kryo.referenceTracking=false
>>
>> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch
>
>
> Did it work for everybody? It may take days if not weeks to try to tune
> all of these parameters for a specific job.
>
> We're on Spark 1.5 / Hive 1.1.
>
>
> ps. We have a job that can't get working well as a Hive job, so thought to
> use Hive on Spark instead. (a 3-table full outer joins with group by +
> collect_list). Spark should handle this much better.
>
>
> Ruslan
>
>
>

Hive on Spark knobs

2016-01-27 Thread Ruslan Dautkhanov

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

There are quite a lot of knobs to tune for Hive on Spark.

Above page recommends following settings:

mapreduce.input.fileinputformat.split.maxsize=75000
> hive.vectorized.execution.enabled=true
> hive.cbo.enable=true
> hive.optimize.reducededuplication.min.reducer=4
> hive.optimize.reducededuplication=true
> hive.orc.splits.include.file.footer=false
> hive.merge.mapfiles=true
> hive.merge.sparkfiles=false
> hive.merge.smallfiles.avgsize=1600
> hive.merge.size.per.task=25600
> hive.merge.orcfile.stripe.level=true
> hive.auto.convert.join=true
> hive.auto.convert.join.noconditionaltask=true
> hive.auto.convert.join.noconditionaltask.size=894435328
> hive.optimize.bucketmapjoin.sortedmerge=false
> hive.map.aggr.hash.percentmemory=0.5
> hive.map.aggr=true
> hive.optimize.sort.dynamic.partition=false
> hive.stats.autogather=true
> hive.stats.fetch.column.stats=true
> hive.vectorized.execution.reduce.enabled=false
> hive.vectorized.groupby.checkinterval=4096
> hive.vectorized.groupby.flush.percent=0.1
> hive.compute.query.using.stats=true
> hive.limit.pushdown.memory.usage=0.4
> hive.optimize.index.filter=true
> hive.exec.reducers.bytes.per.reducer=67108864
> hive.smbjoin.cache.rows=1
> hive.exec.orc.default.stripe.size=67108864
> hive.fetch.task.conversion=more
> hive.fetch.task.conversion.threshold=1073741824
> hive.fetch.task.aggr=false
> mapreduce.input.fileinputformat.list-status.num-threads=5
> spark.kryo.referenceTracking=false
>
> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch


Did it work for everybody? It may take days if not weeks to try to tune all
of these parameters for a specific job.

We're on Spark 1.5 / Hive 1.1.


ps. We have a job that can't get working well as a Hive job, so thought to
use Hive on Spark instead. (a 3-table full outer joins with group by +
collect_list). Spark should handle this much better.


Ruslan

Re: Spark + Sentry + Kerberos don't add up?

2016-01-20 Thread Ruslan Dautkhanov

I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36
Feel free to close it if doesn't belong to Livy project.
I really don't know if this is a Spark or a Livy/Sentry problem.

Any ideas for possible workarounds?

Thank you.



-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Hi Romain,
>
> Thank you for your response.
>
> Adding Kerberos support might be as simple as
> https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal
> and --keytab parameters to be passed to spark-submit.
>
> As a workaround I just did kinit (using hues' keytab) and then launched
> Livy Server. It probably will work as long as kerberos ticket doesn't
> expire. That's it would be great to have support for --principal and
> --keytab parameters for spark-submit as explined in
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html
>
>
> The only problem I have currently is the above error stack in my previous
> email:
>
> The Spark session could not be created in the cluster:
>> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
>> UserGroupInformation.java:1671)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
>> SparkSubmit.scala:160)
>
>
>
> >> AFAIK Hive impersonation should be turned off when using Sentry
>
> Yep, exactly. That's what I did. It is disabled now. But looks like on
> other hand, Spark or Spark Notebook want to have that enabled?
> It tries to do org.apache.hadoop.security.UserGroupInformation.doAs()
> hence the error.
>
> So Sentry isn't compatible with Spark in kerberized clusters? Is any
> workaround for this problem?
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux <rom...@cloudera.com>
> wrote:
>
>> Livy does not support any Kerberos yet
>> https://issues.cloudera.org/browse/LIVY-3
>>
>> Are you focusing instead about HS2 + Kerberos with Sentry?
>>
>> AFAIK Hive impersonation should be turned off when using Sentry:
>> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>>
>> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov <dautkha...@gmail.com
>> > wrote:
>>
>>> Getting following error stack
>>>
>>> The Spark session could not be created in the cluster:
>>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>>> (UserGroupInformation.java:1671)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>>> .open(HiveMetaStoreClient.java:466)
>>>> at
>>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>>> at
>>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>>> ... 35 more
>>>
>>>
>>> My understanding that hive.server2.enable.impersonation and
>>> hive.server2.enable.doAs should be enabled to make
>>> UserGroupInformation.doAs() work?
>>>
>>> When I try to enable these parameters, Cloudera Manager shows error
>>>
>>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>>> (hostname)'.
>>>> Hive Impersonation should be disabled to enable Hive authorization
>>>> using Sentry
>>>
>>>
>>> So Spark-Hive conflicts with Sentry!?
>>>
>>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>>> CDH 5.5.
>>>
>>> This is a kerberized cluster with Sentry.
>>>
>>> I was using hue's keytab as hue user is normally (by default in CDH) is
>>> allowed to impersonate to other users.
>>> So very convenient for Spark Notebooks.
>>>
>>> Any information to help solve this will be highly appreciated.
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Hue-Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to hue-user+unsubscr...@cloudera.org.
>>>
>>
>>
>

Re: Spark + Sentry + Kerberos don't add up?

2016-01-18 Thread Ruslan Dautkhanov

Hi Romain,

Thank you for your response.

Adding Kerberos support might be as simple as
https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal and
--keytab parameters to be passed to spark-submit.

As a workaround I just did kinit (using hues' keytab) and then launched
Livy Server. It probably will work as long as kerberos ticket doesn't
expire. That's it would be great to have support for --principal and
--keytab parameters for spark-submit as explined in
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html


The only problem I have currently is the above error stack in my previous
email:

The Spark session could not be created in the cluster:
> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
> UserGroupInformation.java:1671)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
> SparkSubmit.scala:160)



>> AFAIK Hive impersonation should be turned off when using Sentry

Yep, exactly. That's what I did. It is disabled now. But looks like on
other hand, Spark or Spark Notebook want to have that enabled?
It tries to do org.apache.hadoop.security.UserGroupInformation.doAs() hence
the error.

So Sentry isn't compatible with Spark in kerberized clusters? Is any
workaround for this problem?


-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux <rom...@cloudera.com> wrote:

> Livy does not support any Kerberos yet
> https://issues.cloudera.org/browse/LIVY-3
>
> Are you focusing instead about HS2 + Kerberos with Sentry?
>
> AFAIK Hive impersonation should be turned off when using Sentry:
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>
> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> Getting following error stack
>>
>> The Spark session could not be created in the cluster:
>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>> (UserGroupInformation.java:1671)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>> .open(HiveMetaStoreClient.java:466)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>> ... 35 more
>>
>>
>> My understanding that hive.server2.enable.impersonation and
>> hive.server2.enable.doAs should be enabled to make
>> UserGroupInformation.doAs() work?
>>
>> When I try to enable these parameters, Cloudera Manager shows error
>>
>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>> (hostname)'.
>>> Hive Impersonation should be disabled to enable Hive authorization using
>>> Sentry
>>
>>
>> So Spark-Hive conflicts with Sentry!?
>>
>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>> CDH 5.5.
>>
>> This is a kerberized cluster with Sentry.
>>
>> I was using hue's keytab as hue user is normally (by default in CDH) is
>> allowed to impersonate to other users.
>> So very convenient for Spark Notebooks.
>>
>> Any information to help solve this will be highly appreciated.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Hue-Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to hue-user+unsubscr...@cloudera.org.
>>
>
>

Spark + Sentry + Kerberos don't add up?

2016-01-17 Thread Ruslan Dautkhanov

Getting following error stack

The Spark session could not be created in the cluster:
> at org.apache.hadoop.security.*UserGroupInformation.doAs*
> (UserGroupInformation.java:1671)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
> .open(HiveMetaStoreClient.java:466)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
> at
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> ... 35 more


My understanding that hive.server2.enable.impersonation and
hive.server2.enable.doAs should be enabled to make
UserGroupInformation.doAs() work?

When I try to enable these parameters, Cloudera Manager shows error

Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
> (hostname)'.
> Hive Impersonation should be disabled to enable Hive authorization using
> Sentry


So Spark-Hive conflicts with Sentry!?

Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master). CDH
5.5.

This is a kerberized cluster with Sentry.

I was using hue's keytab as hue user is normally (by default in CDH) is
allowed to impersonate to other users.
So very convenient for Spark Notebooks.

Any information to help solve this will be highly appreciated.


-- 
Ruslan Dautkhanov

livy test problem: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project livy-spark_2.10: There are test failures

2016-01-14 Thread Ruslan Dautkhanov

Livy build test from master fails with below problem. Can't track it down.

YARN shows Livy Spark yarn application as running.
Although attempt to connect to application master shows connection refused:

HTTP ERROR 500
> Problem accessing /proxy/application_1448640910222_0046/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)


Not sure if Livy server has application master UI?

CDH 5.5.1.

Below mvn test output footer:



> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] livy-main .. SUCCESS [
>  1.299 s]
> [INFO] livy-api_2.10 .. SUCCESS [
>  3.622 s]
> [INFO] livy-client-common_2.10  SUCCESS [
>  0.862 s]
> [INFO] livy-client-local_2.10 . SUCCESS [
> 23.866 s]
> [INFO] livy-core_2.10 . SUCCESS [
>  0.316 s]
> [INFO] livy-repl_2.10 . SUCCESS [01:00
> min]
> [INFO] livy-yarn_2.10 . SUCCESS [
>  0.215 s]
> [INFO] livy-spark_2.10  FAILURE [
> 17.382 s]
> [INFO] livy-server_2.10 ... SKIPPED
> [INFO] livy-assembly_2.10 . SKIPPED
> [INFO] livy-client-http_2.10 .. SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 01:48 min
> [INFO] Finished at: 2016-01-14T14:34:28-07:00
> [INFO] Final Memory: 27M/453M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> livy-spark_2.10: There are test failures -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> livy-spark_2.10: There are test failures
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoFailureException: There are test
> failures
> at org.scalatest.tools.maven.TestMojo.execute(TestMojo.java:107)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> [ERROR]
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

Try Phoenix from Cloudera parcel distribution

https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/

They may have better Kerberos support ..

On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Yes, its a kerberized cluster and ticket was generated using kinit command
> before running spark job. That's why Spark on hbase worked but when phoenix
> is used to get the connection to hbase, it does not pass the authentication
> to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark
> 1.3.1 does not provide integration with Phoenix for kerberized cluster.
>
> Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured
> cluster or not?
>
> Thanks,
> Akhilesh
>
> On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> That error is not directly related to spark nor hbase
>>
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>>
>> Is this a kerberized cluster? You likely don't have a good (non-expired)
>> kerberos ticket for authentication to pass.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
>> pathodia.akhil...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running spark job on yarn in cluster mode in secured cluster. I am
>>> trying to run Spark on Hbase using Phoenix, but Spark executors are
>>> unable to get hbase connection using phoenix. I am running knit command to
>>> get the ticket before starting the job and also keytab file and principal
>>> are correctly specified in connection URL. But still spark job on each node
>>> throws below error:
>>>
>>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>>> failed. The most likely cause is missing or invalid credentials. Consider
>>> 'kinit'.
>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>> GSSException: No valid credentials provided (Mechanism level: Failed to
>>> find any Kerberos tgt)]
>>> at
>>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>>
>>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
>>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
>>> this link:
>>>
>>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>>>
>>> Also, I found that there is a known issue for yarn-cluster mode for
>>> Spark 1.3.1 version:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6918
>>>
>>> Has anybody been successful in running Spark on hbase using Phoenix in
>>> yarn cluster or client mode?
>>>
>>> Thanks,
>>> Akhilesh Pathodia
>>>
>>
>>
>

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov

How many reducers you had that created those avro files?
Each reducer very likely creates its own avro part- file.

We normally use Parquet, but it should be the same for Avro, so this might
be
relevant
http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289




-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 11:27 AM, Test One <t...@cksworks.com> wrote:

> I'm using spark-avro with SparkSQL to process and output avro files. My
> data has the following schema:
>
> root
>  |-- memberUuid: string (nullable = true)
>  |-- communityUuid: string (nullable = true)
>  |-- email: string (nullable = true)
>  |-- firstName: string (nullable = true)
>  |-- lastName: string (nullable = true)
>  |-- username: string (nullable = true)
>  |-- profiles: map (nullable = true)
>  ||-- key: string
>  ||-- value: string (valueContainsNull = true)
>
>
> When I write the file output as such with:
> originalDF.write.avro("masterNew.avro")
>
> The output location is a folder with masterNew.avro and with many many
> files like these:
> -rw-r--r--   1 kcsham  access_bpf 8 Dec  2 11:37 ._SUCCESS.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf 0 Dec  2 11:37 _SUCCESS
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro
>
>
> Where there are ~10 record, it has ~28000 files in that folder. When I
> simply want to copy the same dataset to a new location as an exercise from
> a local master, it takes long long time and having errors like such as
> well.
>
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Not enough space to cache
> rdd_112058_10705 in memory! (computed 496.0 B so far)
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
> disk instead.
> [Stage 0:===>   (10706 + 1) /
> 28014]22:01:44.574 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.
>
>
> I'm attributing that there are way too many files to manipulate. The
> questions:
>
> 1. Is this the expected format of the avro file written by spark-avro? and
> each 'part-' is not more than 4k?
> 2. My use case is to append new records to the existing dataset using:
> originalDF.unionAll(stageDF).write.avro(masterNew)
> Any sqlconf, sparkconf that I should set to allow this to work?
>
>
> Thanks,
> kc
>
>
>

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

That error is not directly related to spark nor hbase

javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]

Is this a kerberized cluster? You likely don't have a good (non-expired)
kerberos ticket for authentication to pass.


-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. I am
> trying to run Spark on Hbase using Phoenix, but Spark executors are
> unable to get hbase connection using phoenix. I am running knit command to
> get the ticket before starting the job and also keytab file and principal
> are correctly specified in connection URL. But still spark job on each node
> throws below error:
>
> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
> this link:
>
> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>
> Also, I found that there is a known issue for yarn-cluster mode for Spark
> 1.3.1 version:
>
> https://issues.apache.org/jira/browse/SPARK-6918
>
> Has anybody been successful in running Spark on hbase using Phoenix in
> yarn cluster or client mode?
>
> Thanks,
> Akhilesh Pathodia
>

Re: question about combining small parquet files

2015-11-26 Thread Ruslan Dautkhanov

An interesting compaction approach of small files is discussed recently
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/


AFAIK Spark supports views too.


-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
nyigitb...@netflix.com.invalid> wrote:

> Hi Spark people,
> I have a Hive table that has a lot of small parquet files and I am
> creating a data frame out of it to do some processing, but since I have a
> large number of splits/files my job creates a lot of tasks, which I don't
> want. Basically what I want is the same functionality that Hive provides,
> that is, to combine these small input splits into larger ones by specifying
> a max split size setting. Is this currently possible with Spark?
>
> I look at coalesce() but with coalesce I can only control the number
> of output files not their sizes. And since the total input dataset size
> can vary significantly in my case, I cannot just use a fixed partition
> count as the size of each output file can get very large. I then looked for
> getting the total input size from an rdd to come up with some heuristic to
> set the partition count, but I couldn't find any ways to do it (without
> modifying the spark source).
>
> Any help is appreciated.
>
> Thanks,
> Nezih
>
> PS: this email is the same as my previous email as I learned that my
> previous email ended up as spam for many people since I sent it through
> nabble, sorry for the double post.
>

Re: Spark REST Job server feedback?

2015-11-25 Thread Ruslan Dautkhanov

Very good question. From
http://gethue.com/new-notebook-application-for-spark-sql/

"Use Livy Spark Job Server from the Hue master repository instead of CDH
(it is currently much more advanced): see build & start the latest Livy
<https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server>
"

Although that post is from April 2015, not sure if it's still accurate.




-- 
Ruslan Dautkhanov

On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar <deenar.toras...@gmail.com
> wrote:

> Hi
>
> I had the same question. Anyone having used Livy and/opr SparkJobServer,
> would like their input.
>
> Regards
> Deenar
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
>
> On 8 October 2015 at 20:39, Tim Smith <secs...@gmail.com> wrote:
>
>> I am curious too - any comparison between the two. Looks like one is
>> Datastax sponsored and the other is Cloudera. Other than that, any
>> major/core differences in design/approach?
>>
>> Thanks,
>>
>> Tim
>>
>>
>> On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal <
>> ramirezquetza...@gmail.com> wrote:
>>
>>> Anyone has feedback on using Hue / Spark Job Server REST servers?
>>>
>>>
>>> http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/
>>>
>>> https://github.com/spark-jobserver/spark-jobserver
>>>
>>> Many thanks,
>>>
>>> Rami
>>>
>>
>>
>>
>> --
>>
>> --
>> Thanks,
>>
>> Tim
>>
>
>

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov

public long getTime()

Returns the number of milliseconds since January 1, 1970, 00:00:00 GMT
represented by this Date object.

http://docs.oracle.com/javase/7/docs/api/java/util/Date.html#getTime%28%29

Based on what you did i might be easier to get date partitioner from that.
Also, to get even more even distriubution you could use a hash function
from that not just a remainder.




-- 
Ruslan Dautkhanov

On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin <mcgloin.patr...@gmail.com>
wrote:

> I will answer my own question, since I figured it out.  Here is my answer
> in case anyone else has the same issue.
>
> My DateTimes were all without seconds and milliseconds since I wanted to
> group data belonging to the same minute. The hashCode() for Joda DateTimes
> which are one minute apart is a constant:
>
> scala> val now = DateTime.now
> now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
>
> scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - 
> now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
> res42: Int = 6
>
> As can be seen by this example, if the hashCode values are similarly
> spaced, they can end up in the same partition:
>
> scala> val nums = for(i <- 0 to 100) yield ((i*20 % 1000), i)
> nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), 
> (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), 
> (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), 
> (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), 
> (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), 
> (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), 
> (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), 
> (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), 
> (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), 
> (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), 
> (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
>
> scala> val rddNum = sc.parallelize(nums)
> rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at 
> parallelize at :23
>
> scala> val reducedNum = rddNum.reduceByKey(_+_)
> reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at 
> reduceByKey at :25
>
> scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, 
> true).collect.toList
>
> res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0)
>
> To distribute my data more evenly across the partitions I created my own
> custom Partitoiner:
>
> class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
>   def numPartitions: Int = rddNumPartitions
>   def getPartition(key: Any): Int = {
> key match {
>   case dateTime: DateTime =>
> val sum = dateTime.getYear + dateTime.getMonthOfYear +  
> dateTime.getDayOfMonth + dateTime.getMinuteOfDay  + dateTime.getSecondOfDay
> sum % numPartitions
>   case _ => 0
> }
>   }
> }
>
>
> On 20 November 2015 at 17:17, Patrick McGloin <mcgloin.patr...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have Spark application which contains the following segment:
>>
>> val reparitioned = rdd.repartition(16)
>> val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, 
>> endDate)
>> val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, 
>> kv._2))
>> val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
>>
>> When I run this with some logging this is what I see:
>>
>> reparitioned ==> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 
>> 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
>> filtered ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 
>> 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
>> mapped ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 
>> 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
>> reduced ==> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
>>
>> My logging is done using these two lines:
>>
>> val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, 
>> true)log.info(s"rdd ==> [${sizes.collect.toList}]")
>>
>> My question is why does my data end up in one partition after the
>> reduceByKey? After the filter it can be seen that the data is evenly
>> distributed, but the reduceByKey results in data in only one partition.
>>
>> Thanks,
>>
>> Patrick
>>
>
>

Re: ISDATE Function

2015-11-18 Thread Ruslan Dautkhanov

You could write your own UDF isdate().



-- 
Ruslan Dautkhanov

On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani <rrav...@gmail.com> wrote:

> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> ISDATE() is currently not supported.
>> Since it is SQL Server specific, I guess it wouldn't be added to Spark.
>>
>> On Mon, Nov 16, 2015 at 10:46 PM, Ravisankar Mani <rrav...@gmail.com>
>> wrote:
>>
>>> Hi Everyone,
>>>
>>>
>>>  In MSSQL server suppprt "ISDATE()" function is used to fine current
>>> column values date or not?.  Is any possible to achieve current column
>>> values date or not?
>>>
>>>
>>> Regards,
>>> Ravi
>>>
>>
>>
>

Re: kerberos question

2015-11-06 Thread Ruslan Dautkhanov

You could probably instead of specifying --principal [principle] --keytab
[keytab] --proxy-user [proxied_user] ...
arguments just create/renew a kerberos ticket before submitting a job
$ kinit prinipal.name -kt keytab.file
$ spark-submit ...

Do you need impersonation / proxy user at all? I thought it's primary use
is for Hue and similar services which uses impersonation quite heavily in
kerberized cluster.



-- 
Ruslan Dautkhanov

On Wed, Nov 4, 2015 at 1:40 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0]
> hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) -
> Could not find uri with key [dfs. encryption.key.provider.uri] to
> create a keyProvider !!
>
> Could it be related to HDFS-7931 ?
>
> On Wed, Nov 4, 2015 at 12:30 PM, Chen Song <chen.song...@gmail.com> wrote:
>
>> After a bit more investigation, I found that it could be related to
>> impersonation on kerberized cluster.
>>
>> Our job is started with the following command.
>>
>> /usr/lib/spark/bin/spark-submit --master yarn-client --principal [principle] 
>> --keytab [keytab] --proxy-user [proxied_user] ...
>>
>>
>> In application master's log,
>>
>> At start up,
>>
>> 2015-11-03 16:03:41,602 INFO  [main] yarn.AMDelegationTokenRenewer 
>> (Logging.scala:logInfo(59)) - Scheduling login from keytab in 64789744 
>> millis.
>>
>> Later on, when the delegation token renewer thread kicks in, it tries to
>> re-login with the specified principle with new credentials and tries to
>> write the new credentials into the over to the directory where the current
>> user's credentials are stored. However, with impersonation, because the
>> current user is a different user from the principle user, it fails with
>> permission error.
>>
>> 2015-11-04 10:03:31,366 INFO  [Delegation Token Refresh Thread-0] 
>> yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Attempting to 
>> login to KDC using principal: principal/host@domain
>> 2015-11-04 10:03:31,665 INFO  [Delegation Token Refresh Thread-0] 
>> yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Successfully 
>> logged into KDC.
>> 2015-11-04 10:03:31,702 INFO  [Delegation Token Refresh Thread-0] 
>> yarn.YarnSparkHadoopUtil (Logging.scala:logInfo(59)) - getting token for 
>> namenode: 
>> hdfs://hadoop_abc/user/proxied_user/.sparkStaging/application_1443481003186_0
>> 2015-11-04 10:03:31,904 INFO  [Delegation Token Refresh Thread-0] 
>> hdfs.DFSClient (DFSClient.java:getDelegationToken(1025)) - Created 
>> HDFS_DELEGATION_TOKEN token 389283 for principal on ha-hdfs:hadoop_abc
>> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] 
>> hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - 
>> Could not find uri with key [dfs.encryption.key.provider.uri] to create a 
>> keyProvider !!
>> 2015-11-04 10:03:31,944 WARN  [Delegation Token Refresh Thread-0] 
>> security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - 
>> PriviledgedActionException as:proxy-user (auth:SIMPLE) 
>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2015-11-04 10:03:31,945 WARN  [Delegation Token Refresh Thread-0] ipc.Client 
>> (Client.java:run(675)) - Exception encountered while connecting to the 
>> server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2015-11-04 10:03:31,945 WARN  [Delegation Token Refresh Thread-0] 
>> security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - 
>> PriviledgedActionException as:proxy-user (auth:SIMPLE) 
>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 2015-11-04 10:03:31,963 WARN  [Delegation Token Refresh Thread-0] 
>> yarn.YarnSparkHadoopUtil (Logging.scala:logWarning(92)) - Error while 
>> attempting to list files from application staging dir
>> org.apache.hadoop.security.AccessControlException: Permission denied: 
>> user=principal, access=READ_EXECUTE, 
>> inode="/user/proxy-user/.sparkStaging/application_1443481003186_0":proxy-user:proxy-user:drwx--
>>
>>
>> Can someone confirm my understanding is right? The class relevant is
>> below,
>> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
>>
>>

Re: Pivot Data in Spark and Scala

2015-10-30 Thread Ruslan Dautkhanov

https://issues.apache.org/jira/browse/SPARK-8992

Should be in 1.6?



-- 
Ruslan Dautkhanov

On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss <ascot.m...@gmail.com> wrote:

> Hi,
>
> I have data as follows:
>
> A, 2015, 4
> A, 2014, 12
> A, 2013, 1
> B, 2015, 24
> B, 2013 4
>
>
> I need to convert the data to a new format:
> A ,4,12,1
> B,   24,,4
>
> Any idea how to make it in Spark Scala?
>
> Thanks
>
>

save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov

http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases


Spark JDBC can read data from JDBC, but can it save back to JDBC?
Like to an Oracle database through its jdbc driver.

Also looked at SQL Context documentation
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SQLContext.html
and can't find anything relevant.

Thanks!


-- 
Ruslan Dautkhanov

Re: save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov

Thank you Richard and Matthew.

DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned
earlier, we're on CDH 5.4 / Spark 1.3. No options for this version?


Best regards,
Ruslan Dautkhanov

On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas <rhil...@us.ibm.com> wrote:

> Hi Ruslan,
>
> Here is some sample code which writes a DataFrame to a table in a Derby
> database:
>
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
>
> val binaryVal = Array[Byte] ( 1, 2, 3, 4 )
> val timestampVal = java.sql.Timestamp.valueOf("1996-01-01 03:30:36")
> val dateVal = java.sql.Date.valueOf("1996-01-01")
>
> val allTypes = sc.parallelize(
> Array(
>   (1,
>   1.toLong,
>   1.toDouble,
>   1.toFloat,
>   1.toShort,
>   1.toByte,
>   "true".toBoolean,
>   "one ring to rule them all",
>   binaryVal,
>   timestampVal,
>   dateVal,
>   BigDecimal.valueOf(42549.12)
>   )
> )).toDF(
>   "int_col",
>   "long_col",
>   "double_col",
>   "float_col",
>   "short_col",
>   "byte_col",
>   "boolean_col",
>   "string_col",
>   "binary_col",
>   "timestamp_col",
>   "date_col",
>   "decimal_col"
>   )
>
> val properties = new java.util.Properties()
>
> allTypes.write.jdbc("jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> "all_spark_types", properties)
>
> Hope this helps,
>
> Rick Hillegas
> STSM, IBM Analytics, Platform - IBM USA
>
>
> Ruslan Dautkhanov <dautkha...@gmail.com> wrote on 10/05/2015 02:44:20 PM:
>
> > From: Ruslan Dautkhanov <dautkha...@gmail.com>
> > To: user <user@spark.apache.org>
> > Date: 10/05/2015 02:45 PM
> > Subject: save DF to JDBC
> >
> > http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-
> > to-other-databases
> >
> > Spark JDBC can read data from JDBC, but can it save back to JDBC?
> > Like to an Oracle database through its jdbc driver.
> >
> > Also looked at SQL Context documentation
> > https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/
> > SQLContext.html
> > and can't find anything relevant.
> >
> > Thanks!
> >
> >
> > --
> > Ruslan Dautkhanov
>

Re: Spark data type guesser UDAF

2015-09-21 Thread Ruslan Dautkhanov

Does it deserve to be a JIRA in Spark / Spark MLLib?
How do you guys normally determine data types?

Frameworks like h2o automatically determine data type scanning a sample of
data, or whole dataset.
So then one can decide e.g. if a variable should be a categorical variable
or numerical.

Another use case is if you get an arbitrary data set (we get them quite
often), and want to save as a Parquet table.
Providing correct data types make parquet more space effiecient (and
probably more query-time performant, e.g.
better parquet bloom filters than just storing everything as
string/varchar).

-- 
Ruslan Dautkhanov

On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Wanted to take something like this
>
> https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
> and create a Hive UDAF to create an aggregate function that returns a data
> type guess.
> Am I inventing a wheel?
> Does Spark have something like this already built-in?
> Would be very useful for new wide datasets to explore data. Would be
> helpful for ML too,
> e.g. to decide categorical vs numerical variables.
>
>
> Ruslan
>
>

Re: NGINX + Spark Web UI

2015-09-17 Thread Ruslan Dautkhanov

Similar setup for Hue
http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/

Might give you an idea.



-- 
Ruslan Dautkhanov

On Thu, Sep 17, 2015 at 9:50 AM, mjordan79 <renato.per...@gmail.com> wrote:

> Hello!
> I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI.
> I have 2 machines:
> 1) Machine A, with a public IP. This machine will be used to access Spark
> Web UI on the Machine B through its private IP address.
> 2) Machine B, where Spark is installed (standalone master cluster, 1 worker
> node and the history server) not accessible from the outside.
>
> Basically I want to access the Spark Web UI through my Machine A using the
> URL:
> http://machine_A_ip_address/spark
>
> Currently I have this setup:
> http {
> proxy_set_header X-Real-IP $remote_addr;
> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
> proxy_set_header Host $http_host;
> proxy_set_header X-NginX-Proxy true;
> proxy_set_header X-Ssl on;
> }
>
> # Master cluster node
> upstream app_master {
> server machine_B_ip_address:8080;
> }
>
> # Slave worker node
> upstream app_worker {
> server machine_B_ip_address:8081;
> }
>
> # Job UI
> upstream app_ui {
> server machine_B_ip_address:4040;
> }
>
> # History server
> upstream app_history {
> server machine_B_ip_address:18080;
> }
>
> I'm really struggling in figuring out a correct location directive to make
> the whole thing work, not only for accessing all ports using the url /spark
> but also in making the links in the web app be transformed accordingly. Any
> idea?
>
>
> Any help really appreciated.
> Thank you in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NGINX-Spark-Web-UI-tp24726.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Spark data type guesser UDAF

2015-09-17 Thread Ruslan Dautkhanov

Wanted to take something like this
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
and create a Hive UDAF to create an aggregate function that returns a data
type guess.
Am I inventing a wheel?
Does Spark have something like this already built-in?
Would be very useful for new wide datasets to explore data. Would be
helpful for ML too,
e.g. to decide categorical vs numerical variables.


Ruslan

Re: Spark ANN

2015-09-15 Thread Ruslan Dautkhanov

Thank you Alexander.
Sounds like quite a lot of good and exciting changes slated for Spark's ANN.
Looking forward to it.



-- 
Ruslan Dautkhanov

On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander <alexander.ula...@hpe.com>
wrote:

> Thank you, Feynman, this is helpful. The paper that I linked claims a big
> speedup with FFTs for large convolution size. Though as you mentioned there
> is no FFT transformer in Spark yet. Moreover, we would need a parallel
> version of FFTs to support batch computations. So it probably worth
> considering matrix-matrix multiplication for convolution optimization at
> least as a first version. It can also take advantage of data batches.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Wednesday, September 09, 2015 12:56 AM
>
> *To:* Ulanov, Alexander
> *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru
> *Subject:* Re: Spark ANN
>
>
>
> My 2 cents:
>
>
>
> * There is frequency domain processing available already (e.g. spark.ml
> DCT transformer) but no FFT transformer yet because complex numbers are not
> currently a Spark SQL datatype
>
> * We shouldn't assume signals are even, so we need complex numbers to
> implement the FFT
>
> * I have not closely studied the relative performance tradeoffs, so please
> do let me know if there's a significant difference in practice
>
>
>
>
>
>
>
> On Tue, Sep 8, 2015 at 5:46 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> That is an option too. Implementing convolutions with FFTs should be
> considered as well http://arxiv.org/pdf/1312.5851.pdf.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Tuesday, September 08, 2015 12:07 PM
> *To:* Ulanov, Alexander
> *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru
> *Subject:* Re: Spark ANN
>
>
>
> Just wondering, why do we need tensors? Is the implementation of convnets
> using im2col (see here <http://cs231n.github.io/convolutional-networks/>)
> insufficient?
>
>
>
> On Tue, Sep 8, 2015 at 11:55 AM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Ruslan, thanks for including me in the discussion!
>
>
>
> Dropout and other features such as Autoencoder were implemented, but not
> merged yet in order to have room for improving the internal Layer API. For
> example, there is an ongoing work with convolutional layer that
> consumes/outputs 2D arrays. We’ll probably need to change the Layer’s
> input/output type to tensors. This will influence dropout which will need
> some refactoring to handle tensors too. Also, all new components should
> have ML pipeline public interface. There is an umbrella issue for deep
> learning in Spark https://issues.apache.org/jira/browse/SPARK-5575 which
> includes various features of Autoencoder, in particular
> https://issues.apache.org/jira/browse/SPARK-10408. You are very welcome
> to join and contribute since there is a lot of work to be done.
>
>
>
> Best regards, Alexander
>
> *From:* Ruslan Dautkhanov [mailto:dautkha...@gmail.com]
> *Sent:* Monday, September 07, 2015 10:09 PM
> *To:* Feynman Liang
> *Cc:* Nick Pentreath; user; na...@yandex.ru
> *Subject:* Re: Spark ANN
>
>
>
> Found a dropout commit from avulanov:
>
>
> https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d
>
>
>
> It probably hasn't made its way to MLLib (yet?).
>
>
>
>
>
> --
> Ruslan Dautkhanov
>
>
>
> On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang <fli...@databricks.com>
> wrote:
>
> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
> the roadmap for 1.6 <https://issues.apache.org/jira/browse/SPARK-10324>
> though, and there is a spark package
> <http://spark-packages.org/package/rakeshchalasani/MLlib-dropout> for
> dropout regularized logistic regression.
>
>
>
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
> Thanks!
>
>
>
> It does not look Spark ANN yet supports dropout/dropconnect or any other
> techniques that help avoiding overfitting?
>
> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>
> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>
>
>
> ps. There is a small copy-paste typo in
>
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>
> should read B :)
>
>
>
>
>
> --
> Ruslan Dautkhanov
>
>
>
> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang <fli...@databricks.com>
> wrote:
>
> Backprop is used to comput

Re: Best way to import data from Oracle to Spark?

2015-09-10 Thread Ruslan Dautkhanov

Sathish,

Thanks for pointing to that.

https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm

That must be only part of Oracle's BDA codebase, not open-source Hive,
right?



-- 
Ruslan Dautkhanov

On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> I guess data pump export from Oracle could be fast option. Hive now has
> oracle data pump serde..
>
> https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm
>
>
>
> On Wed, Sep 9, 2015 at 4:41 AM Reynold Xin <r...@databricks.com> wrote:
>
>> Using the JDBC data source is probably the best way.
>> http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases
>>
>> On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin <icecreamlc...@gmail.com> wrote:
>>
>>> What's the best way to import data from Oracle to Spark? Thanks!
>>>
>>>
>>> --
>>> Best regards!
>>>
>>> Lin,Cui
>>>
>>
>>

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Ruslan Dautkhanov

Using dataframe API is a good workaround.

Another way would be to use bind variables. I don't think Spark SQL
supports them.
That's what Dinesh probably meant by "was not able to find any API for
preparing the SQL statement safely avoiding injection".

E.g.

val sql_handler = sqlContext.sql("SELECT name FROM people WHERE age >=
:var1 AND age <= :var2").parse()

toddlers = sql_handler.execute("var1"->1, "var2"->3)

teenagers = sql_handler.execute(13, 19)

It's not possible to do a SQL Injection if Spark SQL would support bind
variables, as parameter would be always treated as variables and not part
of SQL. Also it's arguably easier for developers as you don't have to
escape/quote.

ps. Another advantage is Spark could parse and create plan once - but
execute multiple times.
http://www.akadia.com/services/ora_bind_variables.html
This point is more relevant for OLTP-like queries which Spark is probably
not yet good at (e.g. return a few rows quickly/ winthin a few ms).

-- 
Ruslan Dautkhanov

On Thu, Sep 10, 2015 at 12:07 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Either that or use the DataFrame API, which directly constructs query
> plans and thus doesn't suffer from injection attacks (and runs on the same
> execution engine).
>
> On Thu, Sep 10, 2015 at 12:10 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I don't think this is Spark-specific. Mostly you need to escape /
>> quote user-supplied values as with any SQL engine.
>>
>> On Thu, Sep 10, 2015 at 7:32 AM, V Dineshkumar
>> <developer.dines...@gmail.com> wrote:
>> > Hi,
>> >
>> > What is the preferred way of avoiding SQL Injection while using Spark
>> SQL?
>> > In our use case we have to take the parameters directly from the users
>> and
>> > prepare the SQL Statement.I was not able to find any API for preparing
>> the
>> > SQL statement safely avoiding injection.
>> >
>> > Thanks,
>> > Dinesh
>> > Philips India
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Best way to import data from Oracle to Spark?

2015-09-08 Thread Ruslan Dautkhanov

You can also sqoop oracle data in

$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl
--username MOVIEDEMO --password welcome1 --table ACTIVITY

http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/


-- 
Ruslan Dautkhanov

On Tue, Sep 8, 2015 at 11:11 AM, Cui Lin <icecreamlc...@gmail.com> wrote:

> What's the best way to import data from Oracle to Spark? Thanks!
>
>
> --
> Best regards!
>
> Lin,Cui
>

Spark ANN

2015-09-07 Thread Ruslan Dautkhanov

http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html

Implementation seems missing backpropagation?
Was there is a good reason to omit BP?
What are the drawbacks of a pure feedforward-only ANN?

Thanks!


-- 
Ruslan Dautkhanov

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov

Thanks!

It does not look Spark ANN yet supports dropout/dropconnect or any other
techniques that help avoiding overfitting?
http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
https://cs.nyu.edu/~wanli/dropc/dropc.pdf

ps. There is a small copy-paste typo in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
should read B :)

-- 
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang <fli...@databricks.com>
wrote:

> Backprop is used to compute the gradient here
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L579-L584>,
> which is then optimized by SGD or LBFGS here
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L878>
>
> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Haven't checked the actual code but that doc says "MLPC employes
>> backpropagation for learning the model. .."?
>>
>>
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
>> wrote:
>>
>>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>>
>>> Implementation seems missing backpropagation?
>>> Was there is a good reason to omit BP?
>>> What are the drawbacks of a pure feedforward-only ANN?
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>
>>
>

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov

Found a dropout commit from avulanov:
https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d

It probably hasn't made its way to MLLib (yet?).



-- 
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang <fli...@databricks.com> wrote:

> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
> the roadmap for 1.6 <https://issues.apache.org/jira/browse/SPARK-10324>
> though, and there is a spark package
> <http://spark-packages.org/package/rakeshchalasani/MLlib-dropout> for
> dropout regularized logistic regression.
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> Thanks!
>>
>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>> techniques that help avoiding overfitting?
>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>
>> ps. There is a small copy-paste typo in
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>> should read B :)
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang <fli...@databricks.com>
>> wrote:
>>
>>> Backprop is used to compute the gradient here
>>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L579-L584>,
>>> which is then optimized by SGD or LBFGS here
>>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L878>
>>>
>>> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Haven't checked the actual code but that doc says "MLPC employes
>>>> backpropagation for learning the model. .."?
>>>>
>>>>
>>>>
>>>> —
>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>
>>>>
>>>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov <dautkha...@gmail.com
>>>> > wrote:
>>>>
>>>>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>>>>
>>>>> Implementation seems missing backpropagation?
>>>>> Was there is a good reason to omit BP?
>>>>> What are the drawbacks of a pure feedforward-only ANN?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> --
>>>>> Ruslan Dautkhanov
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov

Read response from Cheng Lian <lian.cs@gmail.com> on Aug/27th - it
looks the same problem.

Workarounds
1. write that parquet file in Spark;
2. upgrade to Spark 1.5.


--
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov <ale...@gmail.com> wrote:

> No, it was created in Hive by CTAS, but any help is appreciated...
>
> On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> That parquet table wasn't created in Spark, is it?
>>
>> There was a recent discussion on this list that complex data types in
>> Spark prior to 1.5 often incompatible with Hive for example, if I remember
>> correctly.
>> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov <ale...@gmail.com> wrote:
>>
>>> I am trying to read an (array typed) parquet file in spark-shell (Spark
>>> 1.4.1 with Hadoop 2.6):
>>>
>>> {code}
>>> $ bin/spark-shell
>>> log4j:WARN No appenders could be found for logger
>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>> log4j:WARN Please initialize the log4j system properly.
>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>>> for more info.
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
>>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
>>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>> users with modify permissions: Set(hivedata)
>>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
>>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
>>> server' on port 43731.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>>>   /_/
>>>
>>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.8.0)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
>>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
>>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
>>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>> users with modify permissions: Set(hivedata)
>>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
>>> 15/09/07 13:45:27 INFO Remoting: Starting remoting
>>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://sparkDriver@10.10.30.52:46083]
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
>>> on port 46083.
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
>>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
>>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
>>> 265.1 MB
>>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
>>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
>>> server' on port 38717.
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
>>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
>>> 4040. Attempting port 4041.
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
>>> port 4041.
>>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at
>>> http://10.10.30.52:4041
>>> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
>>> localhost
>>> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
>>> http://10.10.30.52:43731
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
>>> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov

That parquet table wasn't created in Spark, is it?

There was a recent discussion on this list that complex data types in Spark
prior to 1.5 often incompatible with Hive for example, if I remember
correctly.

On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:

> I am trying to read an (array typed) parquet file in spark-shell (Spark
> 1.4.1 with Hadoop 2.6):
>
> {code}
> $ bin/spark-shell
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
> server' on port 43731.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>   /_/
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
> 15/09/07 13:45:27 INFO Remoting: Starting remoting
> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.10.30.52:46083]
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
> on port 46083.
> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
> 265.1 MB
> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
> server' on port 38717.
> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
> localhost
> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
> http://10.10.30.52:43731
> 15/09/07 13:45:27 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
> localhost, 60973)
> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
> Spark context available as sc.
> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
> 15/09/07 13:45:29 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes
> with
>

Re: Ranger-like Security on Spark

2015-09-03 Thread Ruslan Dautkhanov

You could define access in Sentry and enable permissions sync with HDFS, so
you could just grant access on Hive per-database or per-table basis. It
should work for Spark too, as Sentry will propage "grants" to HDFS acls.

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/sg_hdfs_sentry_sync.html




-- 
Ruslan Dautkhanov

On Thu, Sep 3, 2015 at 1:46 PM, Daniel Schulz <danielschulz2...@hotmail.com>
wrote:

> Hi Matei,
>
> Thanks for your answer.
>
> My question is regarding simple authenticated Spark-on-YARN only, without
> Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my
> HDFS user and only be able to access files I am entitled to read/write?
> Will it enforce HDFS ACLs and Ranger policies as well?
>
> Best regards, Daniel.
>
> > On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> >
> > If you run on YARN, you can use Kerberos, be authenticated as the right
> user, etc in the same way as MapReduce jobs.
> >
> > Matei
> >
> >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz <danielschulz2...@hotmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> I really enjoy using Spark. An obstacle to sell it to our clients
> currently is the missing Kerberos-like security on a Hadoop with simple
> authentication. Are there plans, a proposal, or a project to deliver a
> Ranger plugin or something similar to Spark. The target is to differentiate
> users and their privileges when reading and writing data to HDFS? Is
> Kerberos my only option then?
> >>
> >> Kind regards, Daniel.
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: FAILED_TO_UNCOMPRESS error from Snappy

2015-08-20 Thread Ruslan Dautkhanov

https://issues.apache.org/jira/browse/SPARK-7660 ?


-- 
Ruslan Dautkhanov

On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio tarop...@gmail.com wrote:

 Right after upgraded to 1.4.1, we started seeing this exception and yes we
 picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there
 anything I could try ? I don't have a repro case.

 org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:127)
 at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:169)
 at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:168)
 at
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:168)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
 at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 at
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
 at
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
 at
 org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
 at
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
 at
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1178)
 at
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)
 at
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
 at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
 at scala.util.Try$.apply(Try.scala:161)
 at scala.util.Success.map(Try.scala:206)
 at
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
 at
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
 ... 33 more


 --
 Kohki Nishio

Re: Spark Master HA on YARN

2015-08-16 Thread Ruslan Dautkhanov

There is no Spark master in YARN mode. It's standalone mode terminology.
In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
will be restarted
automatically by RM up to yarn.resourcemanager.am.max-retries
times (default is 2).

--
Ruslan Dautkhanov

On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta bhas...@gmail.com wrote:

 Hi,

 Is Spark master high availability supported on YARN (yarn-client mode)
 analogous to
 https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
 ?

 Thanks
 Bhaskie

Re: Spark job workflow engine recommendations

2015-08-11 Thread Ruslan Dautkhanov

We use Talend, but not for Spark workflows.
Although it does have Spark componenets.

https://www.talend.com/download/talend-open-studio
It is free (commercial support available), easy to design and deploy
workflows.
Talend for BigData 6.0 was released as month ago.

Is anybody using Talend for Spark?



-- 
Ruslan Dautkhanov

On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu h...@linkedin.com.invalid
wrote:

 We are in the middle of figuring that out.  At the high level, we want to
 combine the best parts of existing workflow solutions.

 On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone vikramk...@gmail.com wrote:

 Hien,
 Is Azkaban being phased out at linkedin as rumored? If so, what's
 linkedin going to use for workflow scheduling? Is there something else
 that's going to replace Azkaban?

 On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu yuzhih...@gmail.com wrote:

 In my opinion, choosing some particular project among its peers should
 leave enough room for future growth (which may come faster than you
 initially think).

 Cheers

 On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu h...@linkedin.com wrote:

 Scalability is a known issue due the the current architecture.  However
 this will be applicable if you run more 20K jobs per day.

 On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu yuzhih...@gmail.com wrote:

 From what I heard (an ex-coworker who is Oozie committer), Azkaban is
 being phased out at LinkedIn because of scalability issues (though 
 UI-wise,
 Azkaban seems better).

 Vikram:
 I suggest you do more research in related projects (maybe using their
 mailing lists).

 Disclaimer: I don't work for LinkedIn.

 On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 Hi Vikram,

 We use Azkaban (2.5.0) in our production workflow scheduling. We just
 use local mode deployment and it is fairly easy to set up. It is pretty
 easy to use and has a nice scheduling and logging interface, as well as
 SLAs (like kill job and notify if it doesn't complete in 3 hours or
 whatever).

 However Spark support is not present directly - we run everything
 with shell scripts and spark-submit. There is a plugin interface where 
 one
 could create a Spark plugin, but I found it very cumbersome when I did
 investigate and didn't have the time to work through it to develop that.

 It has some quirks and while there is actually a REST API for adding
 jos and dynamically scheduling jobs, it is not documented anywhere so you
 kinda have to figure it out for yourself. But in terms of ease of use I
 found it way better than Oozie. I haven't tried Chronos, and it seemed
 quite involved to set up. Haven't tried Luigi either.

 Spark job server is good but as you say lacks some stuff like
 scheduling and DAG type workflows (independent of spark-defined job 
 flows).


 On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke jornfra...@gmail.com
 wrote:

 Check also falcon in combination with oozie

 Le ven. 7 août 2015 à 17:51, Hien Luu h...@linkedin.com.invalid a
 écrit :

 Looks like Oozie can satisfy most of your requirements.



 On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com
 wrote:

 Hi,
 I'm looking for open source workflow tools/engines that allow us
 to schedule spark jobs on a datastax cassandra cluster. Since there 
 are
 tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos 
 etc,
 I wanted to check with people here to see what they are using today.

 Some of the requirements of the workflow engine that I'm looking
 for are

 1. First class support for submitting Spark jobs on Cassandra. Not
 some wrapper Java code to submit tasks.
 2. Active open source community support and well tested at
 production scale.
 3. Should be dead easy to write job dependencices using XML or web
 interface . Ex; job A depends on Job B and Job C, so run Job A after 
 B and
 C are finished. Don't need to write full blown java applications to 
 specify
 job parameters and dependencies. Should be very simple to use.
 4. Time based  recurrent scheduling. Run the spark jobs at a given
 time every hour or day or week or month.
 5. Job monitoring, alerting on failures and email notifications on
 daily basis.

 I have looked at Ooyala's spark job server which seems to be hated
 towards making spark jobs run faster by sharing contexts between the 
 jobs
 but isn't a full blown workflow engine per se. A combination of spark 
 job
 server and workflow engine would be ideal

 Thanks for the inputs

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Ruslan Dautkhanov

There is was a similar problem reported before on this list.

Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?

From error stack you use 2.7.10 |Anaconda 2.3.0
while OS/CDH version of Python is probably 2.6.



-- 
Ruslan Dautkhanov

On Mon, Aug 10, 2015 at 3:53 PM, YaoPau jonrgr...@gmail.com wrote:

 I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via
 iPython Notebook.  I'm getting collect() to work just fine, but take()
 errors.  (I'm having issues with collect() on other datasets ... but take()
 seems to break every time I run it.)

 My code is below.  Any thoughts?

  sc
 pyspark.context.SparkContext at 0x7ffbfa310f10
  sys.version
 '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC
 4.4.7 20120313 (Red Hat 4.4.7-1)]'
  hourly = sc.textFile('tester')
  hourly.collect()
 [u'a man',
  u'a plan',
  u'a canal',
  u'panama']
  hourly = sc.textFile('tester')
  hourly.take(2)
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-15-1feecba5868b in module()
   1 hourly = sc.textFile('tester')
  2 hourly.take(2)

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self,
 num)
1223
1224 p = range(partsScanned, min(partsScanned +
 numPartsToTry, totalParts))
 - 1225 res = self.context.runJob(self, takeUpToNumLeft, p,
 True)
1226
1227 items += res

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in
 runJob(self, rdd, partitionFunc, partitions, allowLocal)
 841 # SparkContext#runJob.
 842 mappedRDD = rdd.mapPartitions(partitionFunc)
 -- 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
 mappedRDD._jrdd, javaPartitions, allowLocal)
 844 return list(mappedRDD._collect_iterator_through_file(it))
 845


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
 self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(

 Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 10.0 (TID 47, dhd490101.autotrader.com):
 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 101, in main
 process()
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py,
 line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File

 /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py,
 line 236, in dump_stream
 vs = list(itertools.islice(iterator, batch))
   File /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py, line
 1220, in takeUpToNumLeft
 while taken  left:
 ImportError: No module named iter

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203

Re: Spark Number of Partitions Recommendations

2015-08-01 Thread Ruslan Dautkhanov

You should also take into account amount of memory that you plan to use.
It's advised not to give too much memory for each executor .. otherwise GC
overhead will go up.

Btw, why prime numbers?



-- 
Ruslan Dautkhanov

On Wed, Jul 29, 2015 at 3:31 AM, ponkin alexey.pon...@ya.ru wrote:

 Hi Rahul,

 Where did you see such a recommendation?
 I personally define partitions with the following formula

 partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores )
 )

 where
 nextPrimeNumberAbove(x) - prime number which is greater than x
 K - multiplicator  to calculate start with 1 and encrease untill join
 perfomance start to degrade




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: TCP/IP speedup

2015-08-01 Thread Ruslan Dautkhanov

If your network is bandwidth-bound, you'll see setting jumbo frames (MTU
9000)
may increase bandwidth up to ~20%.

http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm
Enabling Jumbo Frames across the cluster improves bandwidth

If Spark workload is not network bandwidth-bound, I can see it'll be a few
percent to no improvement.



-- 
Ruslan Dautkhanov

On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus edel...@gmail.com wrote:

 H

 2% huh.


 -- ttfn
 Simon Edelhaus
 California 2015

 On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 https://spark-summit.org/2015/events/making-sense-of-spark-performance/

 On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus edel...@gmail.com wrote:

 Hi All!

 How important would be a significant performance improvement to TCP/IP
 itself, in terms of
 overall job performance improvement. Which part would be most
 significantly accelerated?
 Would it be HDFS?

 -- ttfn
 Simon Edelhaus
 California 2015

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Ruslan Dautkhanov

We want these use actions respond within 2 to 5 seconds.

I think this goal is a stretch for Spark. Some queries may run faster than
that on a large dataset,
but in general you can't put an SLA like this. For example if you have to
join some huge datasets,
you'll likely will be much over that. Spark is great for huge jobs and
it'll be much faster than MR.
I don't think Spark was designed with interactive queries in mind. For
example, although Spark is
in-memory, its in-memory is only for a job. It's not like in traditional
RDBMS systems where you
have a persistent buffer cache or in-memory columnar storage (both are
Oracle terms)
If you have multiple users running interatactive BI queries, results that
were cached for first user
wouldn't be used by second user. Unless you invent something that would
keep a persistent
Spark context and serve users' requests and decided which RDDs to cache,
when and how.
At least that's my understanding how Spark works. If I'm wrong, I will be
glad to hear that as
we ran into the same questions.

As we use Cloudera's CDH, I'm not sure where Hortonworks are with their Tez
project,
but Tez has components that resemble closer to buffer cache or in-memory
columnar storage caching
from traditional RDBMS systems, and may get better and/or more predictable
performance on
BI queries.

--
Ruslan Dautkhanov

On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan renga.kan...@gmail.com
wrote:

All,
I really appreciate anyone's input on this. We are having a very simple
traditional OLAP query processing use case. Our use case is as follows.

1. We have a customer sales order table data coming from RDBMs table.
2. There are many dimension columns in the sales order table. For each of
those dimensions, we have individual dimension tables that stores the
dimension record sets.
3. We also have some BI like hierarchies that is defined for dimension data
set.

What we want for business users is as follows.?

1. We wanted to show some aggregated values from sales Order transaction
table columns.
2. User would like to filter these with specific dimension values from
dimension table.
3. User should be able to drill down from higher level to lower level by
traversing hierarchy on dimension

We want these use actions respond within 2 to 5 seconds.

We are thinking about using SPARK as our backend enginee to sever data to
these front end application.

Has anyone tried using SPARK for these kind of use cases. These are all
traditional use cases in BI space. If so, can SPARK respond to these
queries
with in 2 to 5 seconds for large data sets.

Thanks,
Renga

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-23 Thread Ruslan Dautkhanov

Or Spark on HBase )

http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

--
Ruslan Dautkhanov

On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu yuzhih...@gmail.com wrote:

bq. that is, key-value stores

Please consider HBase for this purpose :-)

On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das t...@databricks.com
wrote:

I do not recommend using IndexRDD for state management in Spark
Streaming. What it does not solve out-of-the-box is checkpointing of
indexRDDs, which important because long running streaming jobs can lead to
infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey
operation which you can use, which gives state management capabilities.
Though for most flexible arbitrary look up of stuff, its better to use a
dedicated system that is designed and optimized for long term storage of
data, that is, key-value stores, databases, etc.

On Tue, Jul 14, 2015 at 5:44 PM, Ted Yu yuzhih...@gmail.com wrote:

Please take a look at SPARK-2365 which is in progress.

On Tue, Jul 14, 2015 at 5:18 PM, swetha swethakasire...@gmail.com
wrote:

Hi,

Is IndexedRDD available in Spark 1.4.0? We would like to use this in
Spark
Streaming to do lookups/updates/deletes in RDDs using keys by storing
them
as key/value pairs.

Thanks,
Swetha

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD-available-in-Spark-1-4-0-tp23841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark equivalent for Oracle's analytical functions

2015-07-12 Thread Ruslan Dautkhanov

Should be part of Spark 1.4
https://issues.apache.org/jira/browse/SPARK-1442

I don't see it in the documentation though
https://spark.apache.org/docs/latest/sql-programming-guide.html


-- 
Ruslan Dautkhanov

On Mon, Jul 6, 2015 at 5:06 AM, gireeshp gireesh.puthum...@augmentiq.in
wrote:

 Is there any equivalent of Oracle's *analytical functions* in Spark SQL.

 For example, if I have following data set (say table T):
 /EID|DEPT
 101|COMP
 102|COMP
 103|COMP
 104|MARK/

 In Oracle, I can do something like
 /select EID, DEPT, count(1) over (partition by DEPT) CNT from T;/

 to get:
 /EID|DEPT|CNT
 101|COMP|3
 102|COMP|3
 103|COMP|3
 104|MARK|1/

 Can we do an equivalent query in Spark-SQL? Or what is the best method to
 get such results in Spark dataframes?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-equivalent-for-Oracle-s-analytical-functions-tp23646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Ruslan Dautkhanov

 the executor receives a SIGTERM (from whom???)

From YARN Resource Manager.

Check if yarn fair scheduler preemption and/or speculative execution are
turned on,
then it's quite possible and not a bug.



-- 
Ruslan Dautkhanov

On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim jongw...@nyu.edu wrote:

 Based on my experience, YARN containers can get SIGTERM when

 - it produces too much logs and use up the hard drive
 - it uses off-heap memory more than what is given by
 spark.yarn.executor.memoryOverhead configuration. It might be due to too
 many classes loaded (less than MaxPermGen but more than memoryOverhead), or
 some other off-heap memory allocated by networking library, etc.
 - it opens too many file descriptors, which you can check on the executor
 node's /proc/executor jvm's pid/fd/

 Does any of these apply to your situation?

 Jong Wook

 On Jul 7, 2015, at 19:16, Kostas Kougios kostas.koug...@googlemail.com
 wrote:

 I am still receiving these weird sigterms on the executors. The driver
 claims
 it lost the executor, the executor receives a SIGTERM (from whom???)

 It doesn't seem a memory related issue though increasing memory takes the
 job a bit further or completes it. But why? there is no memory pressure on
 neither driver nor executor. And nothing in the logs indicating so.

 driver:

 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in
 stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes)
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in
 stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240)
 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
 or disconnected! Shutting down. cruncher05.stratified:32976
 15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
 cruncher05.stratified: remote Rpc client disassociated
 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
 from TaskSet 0.0
 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
 or disconnected! Shutting down. cruncher05.stratified:32976
 15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with
 remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].

 15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage
 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
 lost)

 gc log for driver, it doesnt look like it run outofmem:

 2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
 1764131K-1391211K(3393024K), 0.0102839 secs]
 2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
 1764971K-1391867K(3405312K), 0.0099062 secs]
 2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
 1782011K-1392596K(3401216K), 0.0167572 secs]

 executor:

 15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0
 (TID 14750)
 15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
 found, computing it
 15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
 SIGNAL 15: SIGTERM
 15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

 executor gc log (no outofmem as it seems):
 2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
 24696750K-23712939K(33523712K), 0.0416640 secs]
 2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
 24700520K-23722043K(33523712K), 0.0391156 secs]
 2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
 24709182K-23726510K(33518592K), 0.0390784 secs]





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread Ruslan Dautkhanov

You can see what Spark SQL functions are supported in Spark by doing the
following in a notebook:
%sql show functions

https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html

I think Spark SQL support is currently around Hive ~0.11?



-- 
Ruslan Dautkhanov

On Tue, Jul 7, 2015 at 3:10 PM, Jeff J Li l...@us.ibm.com wrote:

 I am trying to use the posexplode function in the HiveContext to
 auto-generate a sequence number. This feature is supposed to be available
 Hive 0.13.0.

 SELECT name, phone FROM contact LATERAL VIEW
 posexplode(phoneList.phoneNumber) phoneTable AS pos, phone

 My test program failed with the following

 java.lang.ClassNotFoundException: posexplode
 at java.net.URLClassLoader.findClass(URLClassLoader.java:665)
 at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:942)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:851)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:827)
 at
 org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:147)
 at
 org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
 at
 org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)

 Does spark support this Hive function posexplode? If not, how to patch it
 to support this? I am on Spark 1.3.1

 Thanks,
 Jeff Li

Re: Caching in spark

2015-07-12 Thread Ruslan Dautkhanov

Hi Akhil,

It's interesting if RDDs are stored internally in a columnar format as well?
Or it is only when an RDD is cached in SQL context, it is converted to
columnar format.
What about data frames?

Thanks!


-- 
Ruslan Dautkhanov

On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

 Thanks
 Best Regards

 On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar vinodsachin...@gmail.com
 wrote:

 Hi Guys,

 Can any one please share me how to use caching feature of spark via spark
 sql queries?

 -Vinod

Re: .NET on Apache Spark?

2015-07-05 Thread Ruslan Dautkhanov

Scala used to run on .NET
http://www.scala-lang.org/old/node/10299


-- 
Ruslan Dautkhanov

On Thu, Jul 2, 2015 at 1:26 PM, pedro ski.rodrig...@gmail.com wrote:

 You might try using .pipe() and installing your .NET program as a binary
 across the cluster (or using addFile). Its not ideal to pipe things in/out
 along with the overhead, but it would work.

 I don't know much about IronPython, but perhaps changing the default python
 by changing your path might work?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NET-on-Apache-Spark-tp23578p23594.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: configuring max sum of cores and memory in cluster through command line

2015-07-05 Thread Ruslan Dautkhanov

It's not possible to specify YARN RM paramers at command line of
spark-submit time. You have to specify all resources that are available on
your cluster to YARN upfront. If you want to limit amount of resource
available for your Spark job, consider using YARN dynamic resource pools
instead

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_resource_pools.html




-- 
Ruslan Dautkhanov

On Thu, Jul 2, 2015 at 4:20 PM, Alexander Waldin awal...@inflection.com
wrote:

  Hi,

 I'd like to specify the total sum of cores / memory as command line
 arguments with spark-submit. That is, I'd like to set
 yarn.nodemanager.resource.memory-mb and the
 yarn.nodemanager.resource.cpu-vcores parameters as described in this blog
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
 post.

 when submitting through the command line, what is the correct way to do
 it? Is it:

 --conf spark.yarn.nodemanager.resource.memory-mb=54g
 --conf spark.yarn.nodemanager.resource.cpu-vcores=31

 or

 --conf yarn.nodemanager.resource.memory-mb=54g
 --conf yarn.nodemanager.resource.cpu-vcores=31


 or something else? I tried these, and I tried looking in the
 ResourceManager UI to see if they were set, but couldn't find them.

 Thanks!

 Alexander

Re: Problem after enabling Hadoop native libraries

2015-06-30 Thread Ruslan Dautkhanov

You can run
hadoop checknative -a
and see if bzip2 is detected correctly.


-- 
Ruslan Dautkhanov

On Fri, Jun 26, 2015 at 10:18 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 What master are you using? If this is not a local master, you'll need to
 set LD_LIBRARY_PATH on the executors also (using
 spark.executor.extraLibraryPath).

 If you are using local, then I don't know what's going on.

 On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh arunabha...@gmail.com
 wrote:

 Hi,
  I'm having trouble reading Bzip2 compressed sequence files after I
 enabled hadoop native libraries in spark.

 Running
 LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/ $SPARK_HOME/bin/spark-submit
 --class  gives the following error

 5/06/26 00:48:02 INFO CodecPool: Got brand-new decompressor [.bz2]
 15/06/26 00:48:02 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID
 3)
 java.lang.UnsupportedOperationException
 at
 org.apache.hadoop.io.compress.bzip2.BZip2DummyDecompressor.decompress(BZip2DummyDecompressor.java:32)
 at
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
 at
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
 at java.io.DataInputStream.readFully(DataInputStream.java:195)
 at java.io.DataInputStream.readLong(DataInputStream.java:416)

 removing the LD_LIBRARY_PATH makes spark run fine but it gives the
 following warning
 WARN NativeCodeLoader: Unable to load native-hadoop library for your
 platform... using builtin-java classes where applicable

 Has anyone else run into this issue ? Any help is welcome.




 --
 Marcelo

Re: flume sinks supported by spark streaming

2015-06-23 Thread Ruslan Dautkhanov

https://spark.apache.org/docs/latest/streaming-flume-integration.html

Yep, avro sink is the correct one.



-- 
Ruslan Dautkhanov

On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi!


 I want to integrate flume with spark streaming. I want to know which sink
 type of flume are supported by spark streaming? I saw one example using
 avroSink.

 Thanks





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/flume-sinks-supported-by-spark-streaming-tp23462.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: [ERROR] Insufficient Space

2015-06-19 Thread Ruslan Dautkhanov

Vadim,

You could edit /etc/fstab, then issue mount -o remount to give more shared
memory online.

Didn't know Spark uses shared memory.

Hope this helps.

On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy vadim.bichuts...@gmail.com
wrote:

 Hello Spark Experts,

 I've been running a standalone Spark cluster on EC2 for a few months now,
 and today I get this error:

 IOError: [Errno 28] No space left on device
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 OpenJDK 64-Bit Server VM warning: Insufficient space for shared memory
 file

 I guess I need to resize the cluster. What's the best way to do that?

 Thanks,
 Vadim
 ᐧ

Re: Does MLLib has attribute importance?

2015-06-18 Thread Ruslan Dautkhanov

Got it. Thanks!



-- 
Ruslan Dautkhanov

On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng men...@gmail.com wrote:

 ChiSqSelector calls an RDD of labeled points, where the label is the
 target. See
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L120

 On Wed, Jun 17, 2015 at 10:22 PM, Ruslan Dautkhanov
 dautkha...@gmail.com wrote:
  Thank you Xiangrui.
 
  Oracle's attribute importance mining function have a target variable.
  Attribute importance is a supervised function that ranks attributes
  according to their significance in predicting a target.
  MLlib's ChiSqSelector does not have a target variable.
 
 
 
 
  --
  Ruslan Dautkhanov
 
  On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng men...@gmail.com wrote:
 
  We don't have it in MLlib. The closest would be the ChiSqSelector,
  which works for categorical data. -Xiangrui
 
  On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov 
 dautkha...@gmail.com
  wrote:
   What would be closest equivalent in MLLib to Oracle Data Miner's
   Attribute
   Importance mining function?
  
  
  
 http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
  
   Attribute importance is a supervised function that ranks attributes
   according to their significance in predicting a target.
  
  
   Best regards,
   Ruslan Dautkhanov

Re: Does MLLib has attribute importance?

2015-06-17 Thread Ruslan Dautkhanov

Thank you Xiangrui.

Oracle's attribute importance mining function have a target variable.
Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target.
MLlib's ChiSqSelector does not have a target variable.




-- 
Ruslan Dautkhanov

On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng men...@gmail.com wrote:

 We don't have it in MLlib. The closest would be the ChiSqSelector,
 which works for categorical data. -Xiangrui

 On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov dautkha...@gmail.com
 wrote:
  What would be closest equivalent in MLLib to Oracle Data Miner's
 Attribute
  Importance mining function?
 
 
 http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
 
  Attribute importance is a supervised function that ranks attributes
  according to their significance in predicting a target.
 
 
  Best regards,
  Ruslan Dautkhanov

Does MLLib has attribute importance?

2015-06-11 Thread Ruslan Dautkhanov

What would be closest equivalent in MLLib to Oracle Data Miner's Attribute
Importance mining function?

http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920

Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target.


Best regards,
Ruslan Dautkhanov

k-means for text mining in a streaming context

2015-06-08 Thread Ruslan Dautkhanov

Hello,

https://spark.apache.org/docs/latest/mllib-feature-extraction.html
would Feature Extraction and Transformation work in a streaming context?

Wanted to extract text features, build K-means clusters for streaming
context
to detect anomalies on a continuous text stream.

Would it be possible?


Best reagrds,
Ruslan Dautkhanov

Re: Spark Job always cause a node to reboot

2015-06-04 Thread Ruslan Dautkhanov

vm.swappiness=0? Some vendors recommend this set to 0 (zero), although I've
seen this causes even kernel to fail to allocate memory.
It may cause node reboot. If that's the case, set vm.swappiness to 5-10 and
decrease spark.*.memory. Your spark.driver.memory+
spark.executor.memory + OS + etc  amount of memory node has.



-- 
Ruslan Dautkhanov

On Thu, Jun 4, 2015 at 8:59 AM, Chao Chen kandy...@gmail.com wrote:

 Hi all,
 I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and
 Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory.
 One is configured as headnode running masters, and 3 others are workers

 But when I try to run the Pagerank from HiBench, it always cause a node to
 reboot during the middle of the work for all scala, java, and python
 versions. But works fine
 with the MapReduce version from the same benchmark.

 I also tried standalone deployment, got the same issue.

 My spark-defaults.conf
 spark.masteryarn-client
 spark.driver.memory 4g
 spark.executor.memory   4g
 spark.rdd.compress  false


 The job submit script is:

 bin/spark-submit  --properties-file
 HiBench/report/pagerank/spark/scala/conf/sparkbench/spark.conf --class
 org.apache.spark.examples.SparkPageRank --master yarn-client
 --num-executors 2 --executor-cores 4 --executor-memory 4G --driver-memory
 4G
 HiBench/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar
 hdfs://discfarm:9000/HiBench/Pagerank/Input/edges
 hdfs://discfarm:9000/HiBench/Pagerank/Output 3

 What is problem with my configuration ? and How can I find the cause ?

 any help is welcome !











 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to monitor Spark Streaming from Kafka?

2015-06-02 Thread Ruslan Dautkhanov

Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf

--
Ruslan Dautkhanov

On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:

Thank you, Tathagata, Cody, Otis.

- Dmitry

On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:

I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box. I see more and more people using it to monitor various
components in data processing pipelines, a la
http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/

Otis

On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:

Hi,

What are some of the good/adopted approached to monitoring Spark
Streaming
from Kafka? I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they
all
assume that Receiver-based streaming is used?

Then Note that one disadvantage of this approach (Receiverless Approach,
#2) is that it does not update offsets in Zookeeper, hence
Zookeeper-based
Kafka monitoring tools will not show progress. However, you can access
the
offsets processed by this approach in each batch and update Zookeeper
yourself.

The code sample, however, seems sparse. What do you need to do here? -
directKafkaStream.foreachRDD(
new FunctionJavaPairRDDlt;String, String, Void() {
@Override
public Void call(JavaPairRDDString, Integer rdd) throws
IOException {
OffsetRange[] offsetRanges =
((HasOffsetRanges)rdd).offsetRanges
// offsetRanges.length = # of Kafka partitions being
consumed
...
return null;
}
}
);

and if these are updated, will KafkaOffsetMonitor work?

Monitoring seems to center around the notion of a consumer group. But in
the receiverless approach, code on the Spark consumer side doesn't seem
to
expose a consumer group parameter. Where does it go? Can I/should I
just
pass in group.id as part of the kafkaParams HashMap?

Thanks

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Ruslan Dautkhanov

It's not only about cores. Keep in mind spark.executor.cores also affects
available memeory for each task:

From
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

The memory available to each task is (spark.executor.memory *
spark.shuffle.memoryFraction *spark.shuffle.safetyFraction)/
spark.executor.cores. Memory fraction and safety fraction default to 0.2
and 0.8 respectively.

I'd test spark.executor.cores with 2,4,8 and 16 and see what makes your job
run faster..


-- 
Ruslan Dautkhanov

On Wed, May 27, 2015 at 6:46 PM, Mulugeta Mammo mulugeta.abe...@gmail.com
wrote:

 My executor has the following spec (lscpu):

 CPU(s): 16
 Core(s) per socket: 4
 Socket(s): 2
 Thread(s) per code: 2

 The CPU count is obviously 4*2*2 = 16. My question is what value is Spark
 expecting in SPARK_EXECUTOR_CORES ? The CPU count (16) or total # of cores
 (2 * 2 = 4) ?

 Thanks

Re: PySpark Logs location

2015-05-21 Thread Ruslan Dautkhanov

https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application

When log aggregation isn’t turned on, logs are retained locally on each
machine under YARN_APP_LOGS_DIR, which is usually configured to/tmp/logs or
$HADOOP_HOME/logs/userlogs depending on the Hadoop version and
installation. Viewing logs for a container requires going to the host that
contains them and looking in this directory. Subdirectories organize log
files by application ID and container ID.

You can enable log aggregation by changing yarn.log-aggregation-enable to
true so it'll be easier to see yarn application logs.

-- 
Ruslan Dautkhanov

On Thu, May 21, 2015 at 5:08 AM, Oleg Ruchovets oruchov...@gmail.com
wrote:

 Doesn't work for me so far ,
using command but got such output. What should I check to fix the
 issue? Any configuration parameters  ...


 [root@sdo-hdp-bd-master1 ~]# yarn logs -applicationId
 application_1426424283508_0048
 15/05/21 13:25:09 INFO impl.TimelineClientImpl: Timeline service address:
 http://hdp-bd-node1.development.c4i:8188/ws/v1/timeline/
 15/05/21 13:25:09 INFO client.RMProxy: Connecting to ResourceManager at
 hdp-bd-node1.development.c4i/12.23.45.253:8050
 /app-logs/root/logs/application_1426424283508_0048does not exist.
 *Log aggregation has not completed or is not enabled.*

 Thanks
 Oleg.

 On Wed, May 20, 2015 at 11:33 PM, Ruslan Dautkhanov dautkha...@gmail.com
 wrote:

 Oleg,

 You can see applicationId in your Spark History Server.
 Go to http://historyserver:18088/

 Also check
 https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application

 It should be no different with PySpark.


 --
 Ruslan Dautkhanov

 On Wed, May 20, 2015 at 2:12 PM, Oleg Ruchovets oruchov...@gmail.com
 wrote:

 Hi Ruslan.
   Could you add more details please.
 Where do I get applicationId? In case I have a lot of log files would it
 make sense to view it from single point.
 How actually I can configure / manage log location of PySpark?

 Thanks
 Oleg.

 On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov 
 dautkha...@gmail.com wrote:

 You could use

 yarn logs -applicationId application_1383601692319_0008



 --
 Ruslan Dautkhanov

 On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets oruchov...@gmail.com
 wrote:

 Hi ,

   I am executing PySpark job on yarn ( hortonworks distribution).

 Could someone pointing me where is the log locations?

 Thanks
 Oleg.

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov

Oleg,

You can see applicationId in your Spark History Server.
Go to http://historyserver:18088/

Also check
https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application

It should be no different with PySpark.


-- 
Ruslan Dautkhanov

On Wed, May 20, 2015 at 2:12 PM, Oleg Ruchovets oruchov...@gmail.com
wrote:

 Hi Ruslan.
   Could you add more details please.
 Where do I get applicationId? In case I have a lot of log files would it
 make sense to view it from single point.
 How actually I can configure / manage log location of PySpark?

 Thanks
 Oleg.

 On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov dautkha...@gmail.com
 wrote:

 You could use

 yarn logs -applicationId application_1383601692319_0008



 --
 Ruslan Dautkhanov

 On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets oruchov...@gmail.com
 wrote:

 Hi ,

   I am executing PySpark job on yarn ( hortonworks distribution).

 Could someone pointing me where is the log locations?

 Thanks
 Oleg.

Re: PySpark Logs location

2015-05-20 Thread Ruslan Dautkhanov

You could use

yarn logs -applicationId application_1383601692319_0008



-- 
Ruslan Dautkhanov

On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets oruchov...@gmail.com
wrote:

 Hi ,

   I am executing PySpark job on yarn ( hortonworks distribution).

 Could someone pointing me where is the log locations?

 Thanks
 Oleg.

Re: Reading Nested Fields in DataFrames

2015-05-11 Thread Ruslan Dautkhanov

Had the same question on stackoverflow recently
http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark


Lomig Mégard had a detailed answer of how to do this without using LATERAL
VIEW.


On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh ashish23...@gmail.com
wrote:

 Hi ,
 I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
 I need help to retrieve the Inner element data in the Structure below.

 Below is the schema when I enter df.printSchema :

  |-- THROTTLING_PERCENTAGE: double (nullable = false)
  |-- IMPRESSION_TYPE: string (nullable = false)
  |-- campaignArray: array (nullable = false)
  ||-- element: struct (containsNull = false)
  |||-- COOKIE: string (nullable = false)
  |||-- CAMPAIGN_ID: long (nullable = false)


 How can I access CAMPAIGN_ID field in this schema ?

 Thanks,
 Ashish Kr. Singh

65 matches

Mail list logo